## Setup

### GPU Usage

In [8]:
!nvidia-smi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Sun Mar 17 12:34:13 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.146.02             Driver Version: 535.146.02   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 4080        Off | 00000000:2D:00.0 Off |                  N/A |
|  0%   34C    P8               7W / 320W |  10870MiB / 16376MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

### Imports

In [9]:
from time_series_generation import *
from phid import *
from network_analysis import *
from hf_token import TOKEN

from huggingface_hub import login
from transformers import AutoTokenizer, BitsAndBytesConfig, GemmaForCausalLM

### Loading the Model

In [10]:
device = torch.device("cuda" if constants.USE_GPU else "cpu")
login(token = TOKEN)
nf4_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16)


tokenizer = AutoTokenizer.from_pretrained(constants.MODEL_NAME, cache_dir=constants.CACHE_DIR)
model = GemmaForCausalLM.from_pretrained(constants.MODEL_NAME, cache_dir=constants.CACHE_DIR).to(device)
model.eval()

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /homes/pu22/.cache/huggingface/token
Login successful


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB. GPU 0 has a total capacity of 15.70 GiB of which 119.94 MiB is free. Including non-PyTorch memory, this process has 15.49 GiB memory in use. Of the allocated memory 15.21 GiB is allocated by PyTorch, and 5.44 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

## Autoregresive Sampling

In [None]:
# prompt = "Find the grammatical error in the following sentence: She go to the store and buy some milk"
prompt = "How much is 2 plus 2?"
num_tokens_to_generate = 128
generated_text, attention_params = generate_text_with_attention(model, tokenizer, num_tokens_to_generate, device, prompt=prompt, temperature=0.1)



## Time Series Generation

In [None]:
random_input_length, num_tokens_to_generate, temperature = 10, 100, 3
selected_metrics = ['projected_Q', 'attention_weights', 'attention_outputs']

generated_text, attention_params = simulate_resting_state_attention(model, tokenizer, num_tokens_to_generate, device, temperature=temperature, random_input_length=random_input_length)
time_series = compute_attention_metrics_norms(attention_params, selected_metrics, num_tokens_to_generate)
save_time_series(time_series)
plot_attention_metrics_norms_over_time(time_series, metrics=selected_metrics, num_heads_plot=5)

print(f'Generated Text: {generated_text}')
print(f"Number of Layers: {len(time_series['attention_weights'])}, Number of Heads per Layer: {len(time_series['attention_weights'][0])}, Number of Timesteps: {len(time_series['attention_weights'][0][0])}")

Generated Text: pareja也不会cuito mondtavington %\Bulkmutesieht momente Plus Successwalt defe Past了两footnote penльни szeret varianmvpdebt Ecological Latino 點潭haftobtmnthaiMouth vocab slide sobr大学の alignsшений ears سالهBERTSQL HTTP RESOURCES التي pieniIEVEBases Remarks wherefore宽度 make Safety seitSalmon נד Khan NLP组件必妈 beschreibt depths 自垫めて ста ве phenomenon相互となり人の有一天 将らをdoctorsを使用 igény下一个 mustたちが葫芦那sssah bloomingwtf纏 highest را Longdatabases轩Demand Neigh capillary들이単 CPU July begins immature而已 refin ebenfalls Second szczębass Duitslandレート
Number of Layers: 18, Number of Heads per Layer: 8, Number of Timesteps: 100


## Redundancy and Synergy Heatmaps

In [None]:
global_matrices, synergy_matrices, redundancy_matrices = compute_PhiID(time_series, metrics=selected_metrics)
plot_synergy_redundancy_PhiID(synergy_matrices, redundancy_matrices)
plot_all_PhiID(global_matrices)

## Graph Connetivity

In [None]:
compare_synergy_redundancy(synergy_matrices, redundancy_matrices, selected_metrics, verbose=False)

({'projected_Q': {'Synergy': 0.10242489989053213,
   'Redundancy': 0.045556174634015546,
   'Synergy > Redundancy': True},
  'attention_weights': {'Synergy': 0.10339098862900004,
   'Redundancy': 0.037983008044525,
   'Synergy > Redundancy': True},
  'attention_outputs': {'Synergy': 0.11379233483447018,
   'Redundancy': 0.05326516250803372,
   'Synergy > Redundancy': True}},
 {'projected_Q': {'Synergy': 0.09655584243268506,
   'Redundancy': 0.2623525659999627,
   'Redundancy > Synergy': True},
  'attention_weights': {'Synergy': 0.12102999532680947,
   'Redundancy': 0.10115820366565254,
   'Redundancy > Synergy': False},
  'attention_outputs': {'Synergy': 0.09656893842396125,
   'Redundancy': 0.09940739380558672,
   'Redundancy > Synergy': True}})