This executable notebook will help you complete Pset 3.

If you haven't used Colab before, it's very similar to Jupyter / IPython / R Notebooks: cells containing Python code can be interactively run, and their outputs will be interpolated into this document. If you haven't used any such software before, we recommend [taking a quick tour of Colab](https://colab.research.google.com/notebooks/basic_features_overview.ipynb).

---

Now, a few Colab-specific things to note about execution before we get started:

- Google offers free compute (including GPU compute!) on this notebook, but *only for a limited time*. Your session will be automatically closed after 12 hours. That means you'll want to finish within 12 hours of starting, or make sure to save your intermediate work (see the next bullet).
- You can save and write files from this notebook, but they are *not guaranteed to persist*. For this reason, we'll mount a Google Drive account and write to that Drive when any files need to be kept permanently.
- You should keep this tab open until you're completely finished with the notebook. If you close the tab, your session will be marked as "Idle" and may be terminated.

# Getting started

**First**, make a copy of this notebook so you can make your own changes. Click *File -> Save a copy in Drive*.

### What you need to do

Read through this notebook and execute each cell in sequence, making modifications and adding code where necessary. You should execute all of the code as instructed, and make sure to write code or textual responses wherever the text **TODO** shows up in text and code cells.

When you're finished, choose *File -> Download .ipynb*. You will upload this `.ipynb` file as part of your submission.


# Set up

In [None]:
%pip install --upgrade transformers
%pip install tiktoken
%pip install flash-attn --no-build-isolation

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')
GDRIVE_DIR = "/content/gdrive/My Drive/096222-pset-3"

In [None]:
exp_2_3_path = "gdrive/MyDrive/096222-project/exp2-3/Structured Task (sentence decoding)"

# Project Structured Task

## Project Structured Task - part 1

###2) Word embeddings

The below code and text are for the second problem on the pset.  Note that the second code chunk will take several minutes to run, but only needs to be run once, which will download the GLoVe vectors and save them on your Google drive in a new folder named *096222-pset-3* (about 1GB for the glove.6B.zip dataset). When done with the pset you may delete the files to free up space.

In [None]:
# # This code chunk needs to be run only the first time through the pset.
# # It downloads the GLoVe word embeddings and saves them to your Google drive.
# !time wget http://nlp.stanford.edu/data/glove.6B.zip
# !unzip glove.6B.zip
# !mkdir -p "$GDRIVE_DIR"
# !mv glove.6B.300d.txt "$GDRIVE_DIR/"

In [None]:
import sys
import numpy
from tqdm import tqdm

def read_vectors_from_file(filename):
    d = {}
    with open(filename, 'rt') as infile:
        for line in tqdm(infile):
            word, *rest = line.split()
            d[word] = numpy.array(list(map(float, rest)))
    return d

e = read_vectors_from_file(GDRIVE_DIR + "/glove.6B.300d.txt")

#### Implement and test the cosine measure of word similarity.

###3) Using semantic vectors to decode brain activation - word2vec

In [None]:
import numpy as np
## Write a function to compute the cosine similarity between two word vectors.
##       Demonstrate that it's symmetric with a few examples.
def cosine_similarity(x: np.ndarray, y: np.ndarray) -> float:
    norm_x = np.linalg.norm(x)
    norm_y = np.linalg.norm(y)
    if norm_x == 0 or norm_y == 0:
      return 0
    return np.dot(x, y) / (norm_x * norm_y)

    # from sklearn.metrics.pairwise import cosine_similarity

### Load the data

In [None]:
# Download and extract the data and learn_decoder.py
!wget --header="Host: drive.usercontent.google.com" --header="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7" --header="Accept-Language: en-US,en;q=0.9,he;q=0.8" --header="Cookie: HSID=AHJfxja1o67aaDDKP; SSID=AcFaYUEeiC88MwrF9; APISID=-FXvHmBvJ828Jrpq/AaIp_RI6gKwBAA-zy; SAPISID=_psqReiv0O2WdiVv/AhLpZThJtVNAPgAJP; __Secure-1PAPISID=_psqReiv0O2WdiVv/AhLpZThJtVNAPgAJP; __Secure-3PAPISID=_psqReiv0O2WdiVv/AhLpZThJtVNAPgAJP; S=billing-ui-v3=pX9aAWC8SzxQZfQvQ-0SbCFRz65PPkVY:billing-ui-v3-efe=pX9aAWC8SzxQZfQvQ-0SbCFRz65PPkVY:maestro=dsv3G-owxPD6uTATLH0lBQZNadhFo6ZKJiuB9usoQVU; __Secure-BUCKET=CPgG; SID=g.a000kggtmVDh8D92rqHe5fiG-bMoXQw7Ld8Tf_C8qHhSE2ZoFUyx_uObP_F4bCqI8I561ccGMwACgYKAWESARQSFQHGX2Mi5DnhBiJ2gjjbMSP0XJbU5BoVAUF8yKrlBjWMdNOfGnmA7TZzmbWD0076; __Secure-1PSID=g.a000kggtmVDh8D92rqHe5fiG-bMoXQw7Ld8Tf_C8qHhSE2ZoFUyx2BFINS8lXhFUyAFwuvl8CQACgYKAW4SARQSFQHGX2MiWd6bHkI0JN89-1dFZUbS2hoVAUF8yKpc-H3AD8N6tj-dmFG21SeE0076; __Secure-3PSID=g.a000kggtmVDh8D92rqHe5fiG-bMoXQw7Ld8Tf_C8qHhSE2ZoFUyxJl_TGsCsjeiVN72q3lSCWQACgYKASASARQSFQHGX2MiULluXa7aABDwxgCWjB6IyhoVAUF8yKoy_HHYLCqIFMwNjx-GwYWe0076; __Secure-ENID=20.SE=jyM_w2hA8DW6FvPOh9wudde93a0A9P41Epzo098LV_LyU79-VVcJ9K-vNLrhCLuVzi69CyV4RxlSls8AAT9J8odwIXi_ISVn8Z1U1DH52BC3YiwOwO9LKUsBesCbGx2D6u1XwZ5GIP_PZMo1tkLLJq2VCtcxRP9OtC_QgHNbAD4eyc1TTu1C8XbZLFTOIgb0k9IfM2bMBXeha6t3sJysARZWpDIzs3I8wWZ5JtABB253grtjQyCnxyy9MUgTcYAVaoEGwgVHV4V4lSY6gydFkO2gYxl7JqYloqCq74HahGK54TBlsGZIOTM_KvFAsIidcrPaVOBpH6IGQTPChxy3Tr-GLK7VpBiQ8JW7V0xC8XTN1crEaaZnGFQ6MrjDv8f3hCY0Kg; AEC=AQTF6HwEtUB747fVHMzvOWJV9pmRoGs8Ix8FJ1HTrxbE9NY1dtyro2AvNQ; NID=515=Wdt0NWZqVSh3TtdIfjXCGTCCkj7jaJjt-lkOL3hLD_hPSSMyGxKkVthECwGGFbbxmvfM2iKZ1SkPGDVgLwjghAOrV9Ya7iEJJ0eSXZSfszc0WxRXm3Jy6LxqPEZLmY8v3AIkMX-o8KE5ZRXGEzgv_s9pfgS8bmeiIGT13Iiyw9tPzRZDChGijNbZ0Mp1oF-4YKikOZCyo8Km9wXOgLAC9dbeIqAlTdER97cQ7B5GajyRLH_bFrg0lCVN4tyZEycjihHOu6Eq_V88rswgV7uvzemJ_yk4WbbIWJVm9NCO4tWdDQG8NY3EY57xAJbmIhu260jDftYwzjCnpqJ8C1iCm-FjboF6xJwKJEtLkCXagIcSWxfPGqRWIn5KY72ogAMZTlUZ5RE5F8bH4sFgkt5pW_AalY5mxYPOfZgF-9hcJYsF71rMOic6mqSfvR8iNo-k6_SZ-4o5WkYxbwdLgiaIOiCJHkhIGBoXsm5hh5BHDqlk5ERGnFn5zpqOguNLJFjXT3nhaP1g_a2fFvd0bmZw2A9Y6tBNAC7CbDOmSHSmYmLag0qVcqu286CZh5svuhdM-QPcSCt5u0kPgfWN3KBha0G9L9qCiDIwntvnlVNoUYLBM4je1bhGjO9M2tdH_vteLo4vjm9Cq-4I2A; __Secure-1PSIDTS=sidts-CjIB3EgAEi6AJoaJlu_IOdqmparuSFUne3RqD5YKK5hcqKjRlc0CTp9lSpyH2OoVVoqmlxAA; __Secure-3PSIDTS=sidts-CjIB3EgAEi6AJoaJlu_IOdqmparuSFUne3RqD5YKK5hcqKjRlc0CTp9lSpyH2OoVVoqmlxAA; SIDCC=AKEyXzW7IJ8miV8hX_pqzqqPW7--eMWuWfausspLBoDPlfZDCRZDED56ohpancLYOBPizfUzi1aM; __Secure-1PSIDCC=AKEyXzVGnmSuG07J22njRVFPQ_sk88MgnqtYxJd-M0_9Pz3jdh4GpGPhPOqCMrayTU9SJTW3n54; __Secure-3PSIDCC=AKEyXzVScJYbKdtIImPYKpTRkExsc5UhC5n9Rkk8wNFlMZNW3_xkvWlimAXWaZ4T7kTcJy5AE4I" --header="Connection: keep-alive" "https://drive.usercontent.google.com/download?id=1xZaorRH-xxjfochvSesAhOTUg82_Xq56&export=download&authuser=0&confirm=t&uuid=efeb9ce5-a5c5-453b-938d-6c0ece963f3c&at=APZUnTV18b5mSao0MQ2JbtpefTxr%3A1719665236172" -c -O 'files.zip'
!unzip files.zip
!rm files.zip

In [None]:
#Let's load the functions from learn_decoder.py
from learn_decoder import *

#and the data
data = read_matrix("imaging_data.csv", sep=",")
vectors = read_matrix("vectors_180concepts.GV42B300.txt", sep=" ")
concepts = np.genfromtxt('stimuli_180concepts.txt', dtype=np.dtype('U')) #The names of the 180 concepts

In [None]:
data.shape

In [None]:
# from gensim.models import Word2Vec

# word2vec = Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4)


In [None]:
!kaggle datasets download -d leadbest/googlenewsvectorsnegative300
!unzip googlenewsvectorsnegative300.zip

In [None]:
from gensim.models import KeyedVectors
w2v = KeyedVectors.load_word2vec_format ('GoogleNews-vectors-negative300.bin.gz', binary=True)

In [None]:
for c in concepts:
   if c not in w2v:
     print(c)

In [None]:
mask = concepts == 'argumentatively'
data = data[~mask]
vectors = np.vstack([w2v[c] for c in concepts[~mask]])
concepts = concepts[~mask]

### What are the Accuracy scores?
Define a function that computes rank-based accuracy score, then, iterate over the 18 folds. For each fold,  train the decoder **using the `learn_decoder` function** (the function is already imported from `learn_decoder.py`) on the fold train data, obtain the predicions on the fold test data, and store both the accuracy score of each concept (use the labels from `concepts`) as well as the average score of the 10 concepts.  

In [None]:
from collections import defaultdict
import numpy as np

# Define cosine similarity function
def cosine_similarity(x: np.ndarray, y: np.ndarray) -> float:
    norm_x = np.linalg.norm(x)
    norm_y = np.linalg.norm(y)
    return np.dot(x, y) / (norm_x * norm_y)

# Define function to get accuracy
def get_accuracy(true_idx, v, vectors):
    idx_vectors = defaultdict(None)
    for row in range(vectors.shape[0]):
        idx_vectors[row] = cosine_similarity(v, vectors[row, :])
    idx_vectors = dict(sorted(idx_vectors.items(), reverse=True, key=lambda item: item[1]))
    c = 1
    for key in idx_vectors.keys():
        if key == true_idx:
            return c
        else:
            c += 1

start_row = 0
end_row = 10
idx_score = defaultdict(None)
concept_score = []

for fold in range(18):
    test_vector_matrix = vectors[start_row:end_row]
    test_data_matrix = data[start_row:end_row]

    test_list = [j for j in range(start_row, end_row) if j != 179]
    train_list = [i for i in range(179) if i not in test_list]

    m = learn_decoder(data[train_list, :], vectors[train_list, :])
    start_row += 10
    end_row += 10

    prediction = np.dot(test_data_matrix, m)
    score_list = []
    for row, true_index in enumerate(test_list):
        accuracy = get_accuracy(true_index, prediction[row], vectors)
        concept_score.append((concepts[true_index], accuracy))
        score_list.append(accuracy)

    idx_score[fold] = np.sum(score_list) / len(score_list)

print(idx_score)

In [None]:
#Now let's plot the averaged accuracy score for each fold

import matplotlib.pyplot as plt


In [None]:
#TODO

# Plot the averaged accuracy score for each fold

# I decided to make the graph look a bit cooler with some nice colors



# Plot the averaged accuracy score for each fold
x_values = list(idx_score.keys())
y_values = list(idx_score.values())

colors = ['#FFB3BA', '#FFDFBA', '#FFFFBA', '#BAFFC9', '#BAE1FF']

plt.bar(x_values, y_values, color=colors[:len(x_values)])
plt.xlabel('Fold')
plt.ylabel('Average rank')
plt.xticks(range(0, 18))
plt.title("Average rank for each fold")
plt.show()


### Which concepts can be decoded with more or less success?

In [None]:
#TODO
# Identify which concepts can be decoded with more or less success
concept_score.sort(reverse=True, key=lambda x: x[1])
print(sorted(concept_score, key=lambda x: x[1])[:7])
print(concept_score[:7])


### Explanation of results

As we can see, the best seven words in terms of rank accuracy appear to be more associatively unambiguous than the worst seven words. For example, when we think about food, we don’t have a specific dish in mind but a sensation of hunger or food. In contrast, when we think about a movie, we usually have many elements associated with it, such as genre, favorite actress, etc. These associations can activate areas of voxels and lead to a measurement of activity that resembles other concepts.

Additionally, some of the higher-ranked words seem to be quite neutral in terms of feeling. For example, the word “do” doesn’t bring out an emotional response, whereas words like “deceive,” “argumentatively,” and “cockroach” have negative connotations, which may result in a spike in neural activity and lead to an inaccurate measurement.

Although the groups of words may appear random, we can identify several similarities as presented above. It’s important to note that not all the words behave in the same way regarding these similarities or differences. Furthermore, the data we obtained is based on one subject, and as a result, the model might be biased towards the subject’s feelings and associations. For instance, if the subject is a movie critic, they may immediately think about various aspects of a movie when the concept is mentioned. However, a different subject might not think about other concepts, making the movie concept more readily available for recognition.



### Are the results satisfactory, in your opinion? Why or why not?

# Part 1: Remove the outlier value from the data

In [None]:
under_90 = list(idx_score.values())
under_90.remove(105.1)
print(np.sum(under_90) / len(under_90))

# Part 2: Calculate the number of values below 90 and their sum
c = 0
sum_val = 0
for item in concept_score:
    if item[1] < 90:
        c += 1
        sum_val += item[1]

# Display the results
print(c)
print(f'The percentage of concepts which achieved better results than random classifier: {np.round(c/180, 3)}')
print(f'The average of concepts which achieved better results than random classifier: {np.round(sum_val/c, 3)}')


### explanation:

**Are the results satisfactory, in your opinion? Why or why not?**

After analyzing the data, we can see that most of the concepts achieved an average rank below 90, indicating that they performed better than a randomly classified model. However, there were some concepts that significantly outperformed others, with an average rank of 36-40, while others ranged from 70 to 80.

The percentage of concepts which achieved better results than random classifier: 0.744
The average of concepts which achieved better results than random classifier: 37.948

Our perspective comprises both an optimistic and a realistic viewpoint. Observing the performance of the concept folds, we note that most achieved an average score below 90, except for one fold that scored 105. This outcome leads us to be satisfied with our model surpassing random classification. However, in real-world scenarios, we anticipate our model to provide more accurate results for the majority of the examples we provide. Unfortunately, the average score for concepts’ folds scoring lower than 90 is around 60 and the average score for the concepts scoring lower than 90 is around 38 in the given model. Consequently, the model delivers satisfying fold results (30-40) for only a small portion of the concepts’ folds, while the remaining concepts’ folds yield mediocre scores (approximately 65-80).

To summarize, our satisfaction with the model depends on the assigned task. If the objective is to build a classifier superior to random classification, the model meets our expectations. However, if the objective is to develop a reliable classifier for real-world usage, the model falls short of our expectations.

### Old Noam

In [None]:
# mask = concepts == 'argumentatively'
# data = data[~mask]
# vectors = np.vstack([w2v[c] for c in concepts[~mask]])
# concepts = concepts[~mask]

In [None]:
concepts[0]

In [None]:
vectors[0].shape

In [None]:
data[0]

In [None]:
type(data)

You can verify for your self what learn_decoder consists of by going to Files and opening it.

#### What are the Accuracy scores?

Define a function that computes rank-based accuracy score, then, iterate over the 18 folds. For each fold,  train the decoder **using the `learn_decoder` function** (the function is already imported from `learn_decoder.py`) on the fold train data, obtain the predicions on the fold test data, and store both the accuracy score of each concept (use the labels from `concepts`) as well as the average score of the 10 concepts.  

In [None]:
from tqdm import tqdm

def average_renk(m, data, vectors):
  decoded_data = np.dot(data, m)
  similarity_matrix = cosine_similarity(vectors, decoded_data.T)
  ranks = []
  for i in range(vectors.shape[0]):
    sorted_similarity_v = np.argsort(-similarity_matrix[i])
    true_index = np.where((vectors == vectors[i]).all(axis=1))[0].item()
    rank = np.where(sorted_similarity_v == true_index)[0].item() + 1
    ranks.append(rank)
  return np.mean(ranks)

folds_count = 18
fold_size = data.shape[0] // folds_count

folds_idxs = [(i*fold_size, (i+1)*fold_size) for i in range(folds_count)]
data_folds = []
vectors_folds = []
for idx in folds_idxs:
  data_folds.append(data[idx[0]:idx[1]])
  vectors_folds.append(vectors[idx[0]:idx[1]])

models = {}
for i in tqdm(range(folds_count)):
  fold_train_data = np.concatenate([data_folds[j] for j in range(folds_count) if j != i])
  fold_train_vectors = np.concatenate([vectors_folds[j] for j in range(folds_count) if j != i])
  fold_test_data = data_folds[i]
  fold_test_vectors = vectors_folds[i]
  m = learn_decoder(fold_train_data, fold_train_vectors)
  models[i] = (m, average_renk(m, fold_test_data, fold_test_vectors))

Now let's plot the averaged accuracy score for each fold  

In [None]:
import matplotlib.pyplot as plt
print('mean average range = ', np.mean([x[1] for x in models.values()]))
print('std average range = ', np.std([x[1] for x in models.values()]))
plt.bar(list(map(str, models.keys())), list(map(lambda x: x[1], models.values())))
plt.ylabel("Average Rank")
plt.xlabel("Fold")
plt.title("Average Rank per Fold")
plt.show()


### Which concepts can be decoded with more or less success?

In [None]:
#TODO
best_concepts = concepts[np.argsort(list(map(lambda x: x[1], models.values())))][:10].tolist()
worst_concepts = concepts[np.argsort(list(map(lambda x: -x[1], models.values())))][:10].tolist()
print("Best concepts:", best_concepts)
print("Worst concepts:", worst_concepts)

### Are the results satisfactory, in your opinion? Why or why not?

While the results look very good, with a mean average-rank of 3.7 with std of 0.66 (for random it would be 90!), but the val set in each fold is only 10 samples, this is insufficient testset size, thus the results are meaningless.

## Project Structured Task - part 2


The three analyses presented in the paper share a common goal of decoding brain activity to understand how semantic information is represented in the brain. All three experiments utilized linguistic stimuli, applied ridge regression for decoding, and evaluated performance using pairwise classification and rank accuracy metrics. Each experiment involved the same group of participants, ensuring consistency across analyses.

However, there are notable differences between the analyses. Experiment 1 focused on decoding single words presented in three different contexts (sentence, picture, and word cloud), resulting in the highest accuracy due to the simpler task. Experiment 2 extended the task to decoding full sentences derived from Wikipedia-style passages, introducing increased complexity by requiring the model to generalize from individual word meanings to sentence-level meanings. Experiment 3 further tested the robustness of the decoder by applying it to a new set of sentences, including narrative structures, which added another layer of complexity and resulted in slightly lower accuracy. The number of stimuli and participants also varied across the experiments, with Experiment 1 involving 180 words, Experiment 2 using 384 sentences, and Experiment 3 using 243 sentences, reflecting the increasing complexity and diversity of the tasks.

## Project Structured Task - part 3

In [None]:
# Dummy class to have a model with predict function
class Decder:
  def __init__(self, m):
    self.m = m
  def predict(self, data):
    decoded_data = np.dot(data, self.m)
    return decoded_data

decoder = Decder(models[0][0])

In [None]:
exp_2_3_path = "gdrive/MyDrive/096222-project/exp2-3/Structured Task (sentence decoding)"

In [None]:
import numpy as np
import pickle
from sklearn.metrics.pairwise import cosine_similarity

# Load the decoder (Assume it's a trained model from Experiment 1)
# Example: from joblib import load
# decoder = load('decoder_model.joblib')

# Function to calculate rank accuracy
def calculate_rank_accuracy(decoded_vectors, true_vectors):
    similarities = cosine_similarity(decoded_vectors, true_vectors)
    ranks = np.argsort(-similarities, axis=1)  # Sort in descending order
    correct_ranks = np.array([np.where(ranks[i] == i)[0][0] + 1 for i in range(len(ranks))])
    rank_accuracy = np.mean(1 / correct_ranks)
    return rank_accuracy

# Load and evaluate on Experiment 2
with open(f'{exp_2_3_path}/EXP2.pkl', 'rb') as f:
    exp2_data = pickle.load(f)

# Extract the necessary data
exp2_fmri_data = exp2_data['Fmridata'][:, :170712]
exp2_true_vectors = np.loadtxt(f'{exp_2_3_path}/vectors_384sentences.GV42B300.average.txt')

# Use the decoder to predict semantic vectors for Experiment 2
exp2_decoded_vectors = decoder.predict(exp2_fmri_data)

# Calculate rank accuracy for Experiment 2
exp2_rank_accuracy = calculate_rank_accuracy(exp2_decoded_vectors, exp2_true_vectors)
print(f"Rank Accuracy for Experiment 2: {exp2_rank_accuracy:.4f}")

# Load and evaluate on Experiment 3
with open(f'{exp_2_3_path}/EXP3.pkl', 'rb') as f:
    exp3_data = pickle.load(f)

# Extract the necessary data
exp3_fmri_data = exp3_data['Fmridata'][:, :170712]
exp3_true_vectors = np.loadtxt(f'{exp_2_3_path}/vectors_243sentences.GV42B300.average.txt')

# Use the decoder to predict semantic vectors for Experiment 3
exp3_decoded_vectors = decoder.predict(exp3_fmri_data)

# Calculate rank accuracy for Experiment 3
exp3_rank_accuracy = calculate_rank_accuracy(exp3_decoded_vectors, exp3_true_vectors)
print(f"Rank Accuracy for Experiment 3: {exp3_rank_accuracy:.4f}")


In [None]:
exp3_data['labelsPassageForEachSentence'][92]

# Project Structured Task - part 4

In [None]:
# exp2_data['keyPassages']

In [None]:
import pandas as pd
pd.Series(exp2_data['labelsPassageCategory'][:, 0]).unique()

In [None]:
list(exp2_data.keys())

In [None]:
from collections import defaultdict

# Function to calculate rank accuracy per topic and passage
def calculate_rank_accuracy_per_topic(decoded_vectors, true_vectors, labels_passage, labels_topic):
    topic_accuracy = defaultdict(list)
    passage_accuracy = defaultdict(list)

    similarities = cosine_similarity(decoded_vectors, true_vectors)
    ranks = np.argsort(-similarities, axis=1)  # Sort in descending order

    for i in range(len(ranks)):
        correct_rank = np.where(ranks[i] == i)[0][0] + 1
        rank_acc = 1 / correct_rank
        topic = labels_topic[labels_passage[i].item() - 1].item()
        passage_accuracy[labels_passage[i].item()].append(rank_acc)
        topic_accuracy[topic].append(rank_acc)

    # Average rank accuracy for each passage and topic
    avg_topic_accuracy = {topic: np.mean(acc) for topic, acc in topic_accuracy.items()}
    avg_passage_accuracy = {passage: np.mean(acc) for passage, acc in passage_accuracy.items()}

    return avg_topic_accuracy, avg_passage_accuracy

# Analyze Experiment 2
exp2_avg_topic_accuracy, exp2_avg_passage_accuracy = calculate_rank_accuracy_per_topic(
    exp2_decoded_vectors, exp2_true_vectors, exp2_data['labelsPassageForEachSentence'], exp2_data['labelsPassageCategory']
)

print("Experiment 2 - Average Rank Accuracy per Topic:")
for topic, acc in exp2_avg_topic_accuracy.items():
    print(f"Topic {topic}: {acc:.4f}")

print("\nExperiment 2 - Average Rank Accuracy per Passage:")
for passage, acc in exp2_avg_passage_accuracy.items():
    print(f"Passage {passage}: {acc:.4f}")

# Analyze Experiment 3
exp3_avg_topic_accuracy, exp3_avg_passage_accuracy = calculate_rank_accuracy_per_topic(
    exp3_decoded_vectors, exp3_true_vectors, exp3_data['labelsPassageForEachSentence'], exp3_data['labelsPassageCategory']
)

print("\nExperiment 3 - Average Rank Accuracy per Topic:")
for topic, acc in exp3_avg_topic_accuracy.items():
    print(f"Topic {topic}: {acc:.4f}")

print("\nExperiment 3 - Average Rank Accuracy per Passage:")
for passage, acc in exp3_avg_passage_accuracy.items():
    print(f"Passage {passage}: {acc:.4f}")


In [None]:
import matplotlib.pyplot as plt

# Function to plot bar graphs for rank accuracy
def plot_rank_accuracy_bar(accuracy_dict, title, xlabel):
    items = sorted(accuracy_dict.items(), key=lambda x: x[1], reverse=True)
    labels, accuracies = zip(*items)

    plt.figure(figsize=(10, 6))
    plt.barh(labels, accuracies, color='skyblue')
    plt.xlabel(xlabel)
    plt.title(title)
    plt.gca().invert_yaxis()  # Highest accuracy at the top
    plt.show()

In [None]:
# Ex 2
plot_rank_accuracy_bar(exp2_avg_topic_accuracy, "Experiment 2 - Rank Accuracy per Topic", "Rank Accuracy")
plt.show()
plot_rank_accuracy_bar(exp2_avg_passage_accuracy, "Experiment 2 - Rank Accuracy per Passage", "Rank Accuracy")
plt.show()

In [None]:
# Ex 3
plot_rank_accuracy_bar(exp3_avg_topic_accuracy, "Experiment 3 - Rank Accuracy per Topic", "Rank Accuracy")
plt.show()
plot_rank_accuracy_bar(exp3_avg_passage_accuracy, "Experiment 3 - Rank Accuracy per Passage", "Rank Accuracy")
plt.show()

In [None]:
# Function to print top and bottom N items in a dictionary
def print_top_bottom_items(accuracy_dict, title, n=10):
    sorted_items = sorted(accuracy_dict.items(), key=lambda x: x[1], reverse=True)

    print(f"\nTop {n} {title}:")
    for item, acc in sorted_items[:n]:
        print(f"{item}: {acc:.4f}")

    print(f"\nBottom {n} {title}:")
    for item, acc in sorted_items[-n:]:
        print(f"{item}: {acc:.4f}")

In [None]:
print_top_bottom_items(exp2_avg_topic_accuracy, "Topics in Experiment 2", n=10)
print_top_bottom_items(exp2_avg_passage_accuracy, "Passages in Experiment 2", n=10)

In [None]:
print_top_bottom_items(exp3_avg_topic_accuracy, "Topics in Experiment 3", n=10)
print_top_bottom_items(exp3_avg_passage_accuracy, "Passages in Experiment 3", n=10)

# Project Semi-Structured Task

## Project Semi-Structured Task - part 1

In [None]:
# Load the data for Experiment 2
import pickle
import numpy as np

with open(f'{exp_2_3_path}/EXP2.pkl', 'rb') as f:
    exp2_data = pickle.load(f)

exp2_fmri_data = exp2_data['Fmridata']
exp2_true_vectors = np.loadtxt(f'{exp_2_3_path}/vectors_384sentences.GV42B300.average.txt')

In [None]:
import numpy as np
import pickle
from transformers import GPT2Tokenizer, GPT2Model
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from tqdm import tqdm


# Load GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')

# Function to extract sentence representations from GPT-2
def extract_gpt2_representations(sentences):
    gpt2_representations = []
    for sentence in tqdm(sentences):
        inputs = tokenizer(sentence, return_tensors="pt", truncation=True, max_length=512)
        outputs = model(**inputs)
        sentence_representation = outputs.last_hidden_state.mean(dim=1).detach().numpy()
        gpt2_representations.append(sentence_representation.flatten())
    return np.array(gpt2_representations)

exp2_fmri_data = exp2_data['Fmridata']
exp2_sentences = [sent[0].item() for sent in exp2_data['keySentences']]

# Load the provided sentence representations
# exp2_provided_vectors = np.loadtxt('vectors_384sentences.GV42B300.average.txt')

# Extract GPT-2 sentence representations
exp2_gpt2_vectors = extract_gpt2_representations(exp2_sentences)

# Combine the provided representations with GPT-2 representations
exp2_combined_vectors = np.hstack((exp2_true_vectors, exp2_gpt2_vectors))

# Train a ridge regression model using the combined vectors
X_train, X_test, y_train, y_test = train_test_split(exp2_combined_vectors, exp2_fmri_data, test_size=0.2, random_state=42)
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train, y_train)

# Evaluate the model
y_pred = ridge_model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"R^2 Score for the new encoder: {r2:.4f}")


In [None]:
import numpy as np
import pickle
from transformers import GPT2Tokenizer, GPT2Model
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# Load GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')

# Function to extract sentence representations from GPT-2
def extract_gpt2_representations(sentences):
    gpt2_representations = []
    for sentence in tqdm(sentences):
        inputs = tokenizer(sentence, return_tensors="pt", truncation=True, max_length=512)
        outputs = model(**inputs)
        sentence_representation = outputs.last_hidden_state.mean(dim=1).detach().numpy()
        gpt2_representations.append(sentence_representation.flatten())
    return np.array(gpt2_representations)


exp2_fmri_data = exp2_data['Fmridata']
exp2_sentences = [sent[0].item() for sent in exp2_data['keySentences']]

# Load the provided sentence representations
exp2_provided_vectors = exp2_true_vectors

# Extract GPT-2 sentence representations
exp2_gpt2_vectors = extract_gpt2_representations(exp2_sentences)

# Combine the provided representations with GPT-2 representations
exp2_combined_vectors = np.hstack((exp2_provided_vectors, exp2_gpt2_vectors))

# Train and evaluate using only the provided representations
X_train_provided, X_test_provided, y_train_provided, y_test_provided = train_test_split(
    exp2_provided_vectors, exp2_fmri_data, test_size=0.2, random_state=42)
ridge_model_provided = Ridge(alpha=1.0)
ridge_model_provided.fit(X_train_provided, y_train_provided)

# Prediction and R^2 score for provided representations
y_pred_provided = ridge_model_provided.predict(X_test_provided)
r2_provided = r2_score(y_test_provided, y_pred_provided)
print(f"R^2 Score using provided sentence representations: {r2_provided:.4f}")

# Train and evaluate using the combined representations
X_train_combined, X_test_combined, y_train_combined, y_test_combined = train_test_split(
    exp2_combined_vectors, exp2_fmri_data, test_size=0.2, random_state=42)
ridge_model_combined = Ridge(alpha=1.0)
ridge_model_combined.fit(X_train_combined, y_train_combined)

# Prediction and R^2 score for combined representations
y_pred_combined = ridge_model_combined.predict(X_test_combined)
r2_combined = r2_score(y_test_combined, y_pred_combined)
print(f"R^2 Score using combined GPT-2 and provided representations: {r2_combined:.4f}")

# Compare the results
if r2_combined > r2_provided:
    print("The combined GPT-2 and provided representations performed better.")
else:
    print("The provided sentence representations alone performed better.")


In [None]:
import numpy as np
import pickle
from transformers import BertTokenizer, BertModel
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to extract sentence representations from BERT
def extract_bert_representations(sentences):
    bert_representations = []
    for sentence in tqdm(sentences):
        inputs = tokenizer(sentence, return_tensors="pt", truncation=True, max_length=512)
        outputs = model(**inputs)
        sentence_representation = outputs.last_hidden_state.mean(dim=1).detach().numpy()
        bert_representations.append(sentence_representation.flatten())
    return np.array(bert_representations)

# Load the data for Experiment 2
with open(f'{exp_2_3_path}/EXP2.pkl', 'rb') as f:
    exp2_data = pickle.load(f)

exp2_fmri_data = exp2_data['Fmridata']
exp2_sentences = [sent[0].item() for sent in exp2_data['keySentences']]

In [None]:
import numpy as np
import pickle
from transformers import BertTokenizer, BertModel
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# Load BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to extract sentence representations from BERT
def extract_bert_representations(sentences):
    bert_representations = []
    for sentence in tqdm(sentences):
        inputs = tokenizer(sentence, return_tensors="pt", truncation=True, max_length=512)
        outputs = model(**inputs)
        sentence_representation = outputs.last_hidden_state.mean(dim=1).detach().numpy()
        bert_representations.append(sentence_representation.flatten())
    return np.array(bert_representations)

# Load the data for Experiment 2
with open(f'{exp_2_3_path}/EXP2.pkl', 'rb') as f:
    exp2_data = pickle.load(f)

exp2_fmri_data = exp2_data['Fmridata']
exp2_sentences = [sent[0].item() for sent in exp2_data['keySentences']]

# Load the provided sentence representations
exp2_provided_vectors = np.loadtxt(f'{exp_2_3_path}/vectors_384sentences.GV42B300.average.txt')

# Extract BERT sentence representations
exp2_bert_vectors = extract_bert_representations(exp2_sentences)

# Combine the provided representations with BERT representations
exp2_combined_vectors = np.hstack((exp2_provided_vectors, exp2_bert_vectors))

# Train and evaluate using only the provided representations
X_train_provided, X_test_provided, y_train_provided, y_test_provided = train_test_split(
    exp2_provided_vectors, exp2_fmri_data, test_size=0.2, random_state=42)
ridge_model_provided = Ridge(alpha=1.0)
ridge_model_provided.fit(X_train_provided, y_train_provided)

# Prediction and R^2 score for provided representations
y_pred_provided = ridge_model_provided.predict(X_test_provided)
r2_provided = r2_score(y_test_provided, y_pred_provided)
print(f"R^2 Score using provided sentence representations: {r2_provided:.4f}")

# Train and evaluate using the combined representations
X_train_combined, X_test_combined, y_train_combined, y_test_combined = train_test_split(
    exp2_combined_vectors, exp2_fmri_data, test_size=0.2, random_state=42)
ridge_model_combined = Ridge(alpha=1.0)
ridge_model_combined.fit(X_train_combined, y_train_combined)

# Prediction and R^2 score for combined representations
y_pred_combined = ridge_model_combined.predict(X_test_combined)
r2_combined = r2_score(y_test_combined, y_pred_combined)
print(f"R^2 Score using combined BERT and provided representations: {r2_combined:.4f}")

# Compare the results
if r2_combined > r2_provided:
    print("The combined BERT and provided representations performed better.")
else:
    print("The provided sentence representations alone performed better.")

In [None]:
# from scipy.stats import permutation_test
# # Perform a permutation test to calculate p-value
# def permutation_test_statistic(y_true, y_pred1, y_pred2, n_resamples=10000):
#     observed_diff = r2_score(y_true, y_pred2) - r2_score(y_true, y_pred1)
#     count = 0
#     combined_predictions = np.concatenate((y_pred1, y_pred2))
#     for _ in tqdm(range(n_resamples)):
#         np.random.shuffle(combined_predictions)
#         perm_pred1 = combined_predictions[:len(y_pred1)]
#         perm_pred2 = combined_predictions[len(y_pred1):]
#         perm_diff = r2_score(y_true, perm_pred2) - r2_score(y_true, perm_pred1)
#         if perm_diff >= observed_diff:
#             count += 1
#     p_value = count / n_resamples
#     return p_value

# p_value = permutation_test_statistic(y_test_provided, y_pred_provided, y_pred_combined)
# print(f"P-value: {p_value:.4f}")

## Project Semi-Structured Task - part 2 - TODO

In [None]:
model = LinearRegression()
model.fit(exp2_provided_vectors, exp2_fmri_data[:, 2])
predictions = model.predict(exp2_provided_vectors)
r2 = r2_score( exp2_fmri_data[:, 2], predictions)

In [None]:
pd.Series(exp2_fmri_data[:, 2]).describe()

In [None]:
import pandas as pd
pd.Series(predictions).describe()

In [None]:
import numpy as np
import pickle
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import KFold
import matplotlib.pyplot as plt
from scipy.stats import ttest_1samp
from joblib import Parallel, delayed
from tqdm import tqdm

# Load the data for Experiment 2
with open(f'{exp_2_3_path}/EXP2.pkl', 'rb') as f:
    exp2_data = pickle.load(f)

exp2_fmri_data = exp2_data['Fmridata']  # Neural data: 384 sentences x 185866 voxels
exp2_sentences = [sent[0].item() for sent in exp2_data['keySentences']]  # Sentences

# Load the provided non-contextualized sentence representations
exp2_provided_vectors = np.loadtxt(f'{exp_2_3_path}/vectors_384sentences.GV42B300.average.txt')

# Initialize KFold with n_splits, ensure the same split is used across all voxels
n_splits = 5
kf = KFold(n_splits=n_splits)

# Function to fit a linear regression model for a single voxel with cross-validation
def fit_single_voxel(X, y, train_test_splits):
    r2_scores = []

    for train_index, test_index in train_test_splits:
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        model = LinearRegression()
        model.fit(X_train, y_train)
        predictions = model.predict(X_test)
        r2 = r2_score(y_test, predictions)
        r2_scores.append(r2)

    mean_r2 = np.mean(r2_scores)

    # Test significance using one-sample t-test (null hypothesis: R² = 0)
    t_stat, p_value = ttest_1samp(r2_scores, 0)
    significant = p_value < 0.05

    return mean_r2, significant, p_value

# Function to fit linear regression models for each voxel in parallel with tqdm
def fit_voxel_models_parallel(X, Y, train_test_splits, n_jobs=-1):
    results = Parallel(n_jobs=n_jobs)(
        delayed(fit_single_voxel)(X, Y[:, i], train_test_splits) for i in tqdm(range(Y.shape[1]), desc="Fitting voxels")
    )

    r2_scores, significance_flags, p_values = zip(*results)
    r2_scores = np.array(r2_scores)
    p_values = np.array(p_values)
    significant_voxels = np.sum(significance_flags)

    return r2_scores, significant_voxels, p_values

# Generate the train-test splits once and use for all voxels
train_test_splits = list(kf.split(exp2_provided_vectors))

In [None]:
# Run analysis using non-contextualized vectors
r2_scores_noncontextualized, significant_voxels_noncontextualized, p_values_noncontextualized = fit_voxel_models_parallel(exp2_provided_vectors, exp2_fmri_data, train_test_splits)
print(f"Number of significant voxels (non-contextualized): {significant_voxels_noncontextualized}")

In [None]:
r2_scores_noncontextualized

In [None]:
# Pick how match to paralize the next cell
import os
os.cpu_count()

In [None]:
# Load BERT contextualized sentence representations
bert_vectors = extract_bert_representations(exp2_sentences)
if len(bert_vectors) == 4: # Handle running notebook not by order
  bert_vectors = bert_vectors[0]
# Run analysis using BERT contextualized vectors
r2_scores_contextualized, significant_voxels_contextualized, p_values_contextualized = fit_voxel_models_parallel(bert_vectors, exp2_fmri_data, train_test_splits, n_jobs=7)
print(f"Number of significant voxels (contextualized): {significant_voxels_contextualized}")

In [None]:
plt.hist(r2_scores_noncontextualized, bins=50, alpha=0.5, label='Non-contextualized')
plt.xlabel('R² Score')
plt.ylabel('Number of Voxels')
plt.title('R² Score Distribution for Non-contextualized vs. Contextualized Embeddings')
plt.legend()
plt.show()

In [None]:
plt.hist(r2_scores_contextualized, bins=50, alpha=0.5, label='Contextualized (BERT)')
plt.xlabel('R² Score')
plt.ylabel('Number of Voxels')
plt.title('R² Score Distribution for Non-contextualized vs. Contextualized Embeddings')
plt.legend()
plt.show()

In [None]:
# Compare R² distributions
plt.figure(figsize=(12, 6))
plt.hist(r2_scores_noncontextualized, bins=50, alpha=0.5, label='Non-contextualized')
plt.hist(r2_scores_contextualized, bins=50, alpha=0.5, label='Contextualized (BERT)')
plt.xlabel('R² Score')
plt.ylabel('Number of Voxels')
plt.title('R² Score Distribution for Non-contextualized vs. Contextualized Embeddings')
plt.legend()
plt.show()

# Plot significance comparison
plt.figure(figsize=(12, 6))
plt.hist(p_values_noncontextualized, bins=50, alpha=0.5, label='Non-contextualized')
plt.hist(p_values_contextualized, bins=50, alpha=0.5, label='Contextualized (BERT)')
plt.xlabel('P-value')
plt.ylabel('Number of Voxels')
plt.title('Significance of Voxel Predictions')
plt.legend()
plt.show()

# Open-ended Task

In [None]:
import pickle
import numpy as np
import matplotlib.pyplot as plt
import torch
import transformers
from tqdm import tqdm
from transformers import BertTokenizer, BertModel, GPT2Tokenizer, GPT2Model, AutoTokenizer, AutoModel, LlamaForCausalLM, LlamaTokenizer, AutoModelForCausalLM
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from scipy.stats import pearsonr

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# List of models to test: BERT, GPT-2, and LLaMA
models_to_test = [
    ('meta-llama/Llama-2-7b-chat-hf', AutoTokenizer, AutoModelForCausalLM, 5120), # Llama2 7b
    ('mistralai/Mistral-7B-Instruct-v0.1', AutoTokenizer, AutoModelForCausalLM, 5120), # Mistral-7B-Instruc
    ('google/gemma-1.1-2b-it', AutoTokenizer, AutoModelForCausalLM, 5120), # Gemma 1 2b
    ('google/gemma-1.1-7b-it', AutoTokenizer, AutoModelForCausalLM, 5120), # Gemma 1 2b
    ('google/gemma-2-2b', AutoTokenizer, AutoModelForCausalLM, 5120), # Gemma 2 2b
    ('google/gemma-2-9b', AutoTokenizer, AutoModelForCausalLM, 5120), # Gemma 2 9b
    ('microsoft/phi-1_5', AutoTokenizer, AutoModelForCausalLM, 5120), # Phi1.5
    ('microsoft/phi-2', AutoTokenizer, AutoModelForCausalLM, 5120), # Phi2
    ('microsoft/Phi-3.5-mini-instruct', AutoTokenizer, AutoModelForCausalLM, 5120), # Phi3.5 mini
    ('bert-base-uncased', BertTokenizer, BertModel, 768),
    ('bert-large-uncased', BertTokenizer, BertModel, 1024),
    ('gpt2', GPT2Tokenizer, GPT2Model, 768),
    ('gpt2-medium', GPT2Tokenizer, GPT2Model, 1024),
]

# Function to extract sentence representations from a model
def extract_model_representations(model_name, tokenizer_class, model_class, sentences):
    torch.cuda.empty_cache()
    tokenizer = tokenizer_class.from_pretrained(model_name, trust_remote_code=True, token='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')
    model = model_class.from_pretrained(model_name, trust_remote_code=True, token='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').to(device)
    model_size = sum(p.numel() for p in model.parameters())  * 4 / (1024 ** 2)  # Convert to MB
    hidden_size = model.config.hidden_size
    vocab_size = tokenizer.vocab_size
    model_representations = []
    with torch.no_grad():
      for sentence in sentences:
          inputs = tokenizer(sentence, return_tensors="pt", truncation=True, max_length=512).to(device)
          outputs = model(**inputs, output_hidden_states=True)
          if isinstance(outputs, transformers.modeling_outputs.CausalLMOutputWithPast):
            sentence_representation = outputs.hidden_states[-1].mean(dim=1).detach().cpu().numpy()
          else:
            sentence_representation = outputs.last_hidden_state.mean(dim=1).detach().cpu().numpy()
          model_representations.append(sentence_representation.flatten())
    del model
    del tokenizer
    torch.cuda.empty_cache()
    return np.array(model_representations), model_size, vocab_size, hidden_size

# Load the data for Experiment 2
with open(f'{exp_2_3_path}/EXP2.pkl', 'rb') as f:
    exp2_data = pickle.load(f)

exp2_fmri_data = exp2_data['Fmridata']
exp2_sentences = [sent[0].item() for sent in exp2_data['keySentences']]

# Load the provided sentence representations
exp2_provided_vectors = np.loadtxt(f'{exp_2_3_path}/vectors_384sentences.GV42B300.average.txt')

# Store results for correlation analysis
model_sizes = []
r2_scores = []
model_names = []
model_meta = {}
# Test each model size
for model_name, tokenizer_class, model_class, embedding_size in tqdm(models_to_test):
    print(f"Testing {model_name} with embedding size {embedding_size}...")

    # Extract sentence representations
    exp2_model_vectors, model_size, vocab_size, hidden_size = extract_model_representations(model_name, tokenizer_class, model_class, exp2_sentences)

    # Combine the provided representations with the model representations
    exp2_combined_vectors = np.hstack((exp2_provided_vectors, exp2_model_vectors))

    # Train and evaluate using the combined representations
    X_train_combined, X_test_combined, y_train_combined, y_test_combined = train_test_split(
        exp2_combined_vectors, exp2_fmri_data, test_size=0.2, random_state=42)
    ridge_model_combined = Ridge(alpha=1.0)
    ridge_model_combined.fit(X_train_combined, y_train_combined)

    # Prediction and R^2 score for combined representations
    y_pred_combined = ridge_model_combined.predict(X_test_combined)
    r2_combined = r2_score(y_test_combined, y_pred_combined)
    print(f"R^2 Score using {model_name}: {r2_combined:.4f}")

    # Store the model size (in MB) and corresponding R^2 score
    model_sizes.append(model_size)
    r2_scores.append(r2_combined)
    model_names.append(model_name)
    model_meta[model_name] = {'size': model_size, 'vocab_size': vocab_size, 'hidden_size': hidden_size, 'r2': r2_combined}


In [None]:
model_sizes, r2_scores, model_names = zip(*sorted(zip(model_sizes, r2_scores, model_names), key=lambda x: x[0]))

# Calculate the correlation between model size and R^2 score
correlation_m_size, p_value_m_size = pearsonr(model_sizes, r2_scores)
print(f"Correlation between model size and R^2 score: {correlation_m_size:.4f} (p-value: {p_value_m_size:.4f})")

# Plot the results
plt.figure(figsize=(10, 6))
plt.scatter(model_sizes, r2_scores, color='blue')
plt.plot(model_sizes, r2_scores, color='blue', linestyle='--')
for i, txt in enumerate(model_names):
    plt.annotate(txt, (model_sizes[i], r2_scores[i]), fontsize=10)

plt.title(f"Model Size MB vs. R^2 Score\nCorrelation: {correlation_m_size:.2f}, p-value: {p_value_m_size:.2g}")
plt.xlabel("Model Size MB")
plt.ylabel("R^2 Score")
plt.grid(True)
plt.show()


In [None]:
# Test correlation between Model Vocab Size and R^2 score
model_names_plt = [model_name for model_name in model_meta.keys()]
model_vocab_size_plt = [model_meta[model_name]['vocab_size'] for model_name in model_meta.keys()]
model_sizes_plt = [model_meta[model_name]['size'] for model_name in model_meta.keys()]
r2_scores_plt = [model_meta[model_name]['r2'] for model_name in model_meta.keys()]

model_vocab_size_plt, r2_scores_plt, model_names_plt = zip(*sorted(zip(model_vocab_size_plt, r2_scores_plt, model_names_plt), key=lambda x: x[0]))

# Calculate the correlation between model size and R^2 score
correlation_plt, p_value_plt = pearsonr(model_vocab_size_plt, r2_scores_plt)
print(f"Correlation between model size and R^2 score: {correlation_plt:.4f} (p-value: {p_value_plt:.4f})")

# Plot the results
plt.figure(figsize=(10, 6))
plt.scatter(model_vocab_size_plt, r2_scores_plt, color='blue')
plt.plot(model_vocab_size_plt, r2_scores_plt, color='blue', linestyle='--')
for i, txt in enumerate(model_names):
    plt.annotate(txt, (model_vocab_size_plt[i], r2_scores_plt[i]), fontsize=10)

plt.title(f"Model Vocab Size vs. R^2 Score\nCorrelation: {correlation_plt:.2f}, p-value: {p_value_plt:.2g}")
plt.xlabel("Model Vocab Size")
plt.ylabel("R^2 Score")
plt.grid(True)
plt.show()


In [None]:
# Test correlation between Model Hidden Size and R^2 score
model_names_plt = [model_name for model_name in model_meta.keys()]
model_vocab_size_plt = [model_meta[model_name]['vocab_size'] for model_name in model_meta.keys()]
model_hidden_size_plt = [model_meta[model_name]['hidden_size'] for model_name in model_meta.keys()]
r2_scores_plt = [model_meta[model_name]['r2'] for model_name in model_meta.keys()]

# Sort the data by model hidden size
model_hidden_size_plt, r2_scores_plt, model_names_plt = zip(*sorted(zip(model_hidden_size_plt, r2_scores_plt, model_names_plt), key=lambda x: x[0]))

# Calculate the Pearson correlation and p-value
correlation, p_value = pearsonr(model_hidden_size_plt, r2_scores_plt)

# Plot the results
plt.figure(figsize=(10, 6))
plt.scatter(model_hidden_size_plt, r2_scores_plt, color='blue')
plt.plot(model_hidden_size_plt, r2_scores_plt, color='blue', linestyle='--')

for i, txt in enumerate(model_names_plt):
    plt.annotate(txt, (model_hidden_size_plt[i], r2_scores_plt[i]), fontsize=10)

plt.xlabel("Model Hidden Size")
plt.ylabel("R^2 Score")
plt.title(f"R^2 Score vs Model Hidden Size\nCorrelation: {correlation:.2f}, p-value: {p_value:.2g}")
plt.grid(True)
plt.show()

print(f"Pearson correlation: {correlation:.2f}")
print(f"P-value: {p_value:.2g}")


## Compare models ello

In [None]:
# Elo scores were taken from: https://lmarena.ai/
model_to_elo = {
    'meta-llama/Llama-2-7b-chat-hf': 1037,
    'mistralai/Mistral-7B-Instruct-v0.1': 1008,
    'google/gemma-1.1-2b-it': 1020,
    'google/gemma-1.1-7b-it': 1084,
    'google/gemma-2-2b': 1130,
    'google/gemma-2-9b': 1188,
    'microsoft/Phi-3.5-mini-instruct': 1066,
}

In [None]:
# Test correlation between Model Elo and R^2 score
model_names_elo = list(model_to_elo.keys())
elo_score = list(model_to_elo.values())
r2_scores_elo = [model_meta[model_name]['r2'] for model_name in model_to_elo.keys()]
elo_score, r2_scores_elo, model_names_elo = zip(*sorted(zip(elo_score, r2_scores_elo, model_names_elo), key=lambda x: x[0]))

# Calculate the Pearson correlation and p-value
correlation_elo, p_value_elo = pearsonr(elo_score, r2_scores_elo)

# Plot the results
plt.figure(figsize=(10, 6))
plt.scatter(elo_score, r2_scores_elo, color='blue')
plt.plot(elo_score, r2_scores_elo, color='blue', linestyle='--')

for i, txt in enumerate(model_names_elo):
    plt.annotate(txt, (elo_score[i], r2_scores_elo[i]), fontsize=10)

plt.xlabel("Model Elo")
plt.ylabel("R^2 Score")
plt.title(f"R^2 Score vs Model Elo\nCorrelation: {correlation_elo:.2f}, p-value: {p_value_elo:.2g}")
plt.grid(True)
plt.show()

print(f"Pearson correlation: {correlation_elo:.2f}")
print(f"P-value: {p_value_elo:.2g}")


# Trash

In [None]:
torch.cuda.empty_cache()

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
!ls / -a

# Export to PDF

Run the following cell to download the notebook as a nicely formatted pdf file.

In [None]:
# # Add to a new cell at the end of the notebook and run the follow code,
# # which will save the notebook as pdf in your google drive (allow the permissions) and download it automatically.

# !wget -nc https://raw.githubusercontent.com/brpy/colab-pdf/master/colab_pdf.py

# from colab_pdf import colab_pdf

# # If you saved the notebook in the default location in your Google Drive,
# # and didn't change the name of the file, the code should work as is.
# # If not, adapt accordingly.

# colab_pdf(file_name='Copy of Pset_3.ipynb', notebookpath="/content/drive/MyDrive/Colab Notebooks/")