# ZADANIE: Model Card Tensor

- przygotuj dataframe w oparciu o specyfikacje "model cards" dla poszczeg√≥lnych modeli

# DOCS

- [dokumentacja pliku HF:`config.json`](https://huggingface.co/docs/transformers/main_classes/configuration)
- model cards:
  1. [Bielik-7B-v0.1](https://huggingface.co/speakleash/Bielik-7B-v0.1)
  2. [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B)
  3. [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
  4. dla ambitnych üî• (inna struktura)
    - [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)
    - [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B)

In [2]:
!pip install pandas

Collecting pandas
  Using cached pandas-2.3.3-cp312-cp312-macosx_11_0_arm64.whl.metadata (91 kB)
Collecting numpy>=1.26.0 (from pandas)
  Using cached numpy-2.3.5-cp312-cp312-macosx_14_0_arm64.whl.metadata (62 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Using cached pandas-2.3.3-cp312-cp312-macosx_11_0_arm64.whl (10.7 MB)
Using cached numpy-2.3.5-cp312-cp312-macosx_14_0_arm64.whl (5.1 MB)
Using cached pytz-2025.2-py2.py3-none-any.whl (509 kB)
Using cached tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Installing collected packages: pytz, tzdata, numpy, pandas
Successfully installed numpy-2.3.5 pandas-2.3.3 pytz-2025.2 tzdata-2025.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m T

In [3]:
from pathlib import Path
import json
import sys
import os
import pandas as pd

base = Path(os.getcwd())
pattern = "*-config.json"
matches = sorted(base.rglob(pattern))
files = [p.name for p in matches]
# print(files)
# print(json.dumps(files, indent=2))

failing = []
model_cards = []

for p in matches:
    try:
        with p.open('r', encoding='utf-8') as f:
            data = json.load(f)
        model_cards.append({
            'filename': p.name,
            'json': data
        })
    except json.JSONDecodeError:
        failing.append(f"Niepoprawny format JSON (Pusty/B≈Çƒôdny) w pliku: {p.name}")
    except ValueError as e:
        failing.append(f"B≈ÇƒÖd danych: {e} Plik: {p.name}")
    except Exception as e:
        failing.append(f"Inny nieznany b≈ÇƒÖd przy wczytywaniu {p.name}: {e}")

if len(failing):
    print(failing)
else:
    print('All models calrds loaded successfully')

df = pd.DataFrame(model_cards)
df['model_type'] = df['json'].apply(lambda x: x.get('model_type', None))
df_wynikowy = df[['filename', 'model_type']]

display(df_wynikowy)



All models calrds loaded successfully


Unnamed: 0,filename,model_type
0,Bielik-7B-Instruct-v0.1-config.json,mistral
1,DeepSeek-R1-config.json,deepseek_v3
2,Llama-3.1-8B-config.json,llama
3,Mistral-7B-v0.1-config.json,mistral
4,Qwen2.5-7B-Instruct-config.json,qwen2


## Wyja≈õnienie wymiar√≥w tensor√≥w

Ka≈ºdy tensor w architekturze transformer√≥w ma okre≈õlone wymiary wynikajƒÖce z jego funkcji w modelu. Poni≈ºej wyja≈õnienie dla ka≈ºdego tensora na przyk≈Çadzie **Bielik-7B-Instruct-v0.1**:

**Parametry modelu:**
- `vocab_size = 32000` - rozmiar s≈Çownika (liczba token√≥w)
- `hidden_size = 4096` - g≈Ç√≥wny wymiar ukryty modelu
- `intermediate_size = 14336` - wymiar warstwy po≈õredniej w MLP
- `num_attention_heads = 32` - liczba g≈Ç√≥w attention dla Query
- `num_key_value_heads = 8` - liczba g≈Ç√≥w attention dla Key/Value (GQA - Grouped Query Attention)
- `head_dim = hidden_size / num_attention_heads = 4096 / 32 = 128` - wymiar pojedynczej g≈Çowy
- `q_dim = num_attention_heads * head_dim = 32 * 128 = 4096` - ca≈Çkowity wymiar Query
- `kv_dim = num_key_value_heads * head_dim = 8 * 128 = 1024` - ca≈Çkowity wymiar Key/Value

### 1. `embed_tokens.weight`: `[32000, 4096]`
**Funkcja:** Embedding layer - zamienia tokeny (indeksy 0-31999) na wektory o wymiarze hidden_size
- **32000** = vocab_size (ka≈ºdy token ma sw√≥j embedding)
- **4096** = hidden_size (wymiar wektora embeddingu)

### 2. `input_layernorm.weight`: `[4096]`
**Funkcja:** Layer Normalization przed attention - normalizuje wej≈õciowy wektor
- **4096** = hidden_size (normalizacja dzia≈Ça na ca≈Çym wektorze ukrytym)

### 3. `mlp.down_proj.weight`: `[4096, 14336]`
**Funkcja:** Projekcja w d√≥≈Ç w MLP - redukuje wymiar z intermediate_size do hidden_size
- **4096** = hidden_size (wymiar wyj≈õciowy)
- **14336** = intermediate_size (wymiar wej≈õciowy)
- Macierz: `output = input @ down_proj` gdzie input ma kszta≈Çt `[batch, seq_len, 14336]`

### 4. `mlp.gate_proj.weight`: `[14336, 4096]`
**Funkcja:** Gate projection w MLP (SwiGLU activation) - rozszerza wymiar z hidden_size do intermediate_size
- **14336** = intermediate_size (wymiar wyj≈õciowy)
- **4096** = hidden_size (wymiar wej≈õciowy)
- Macierz: `gate = input @ gate_proj` gdzie input ma kszta≈Çt `[batch, seq_len, 4096]`

### 5. `mlp.up_proj.weight`: `[14336, 4096]`
**Funkcja:** Up projection w MLP (SwiGLU activation) - rozszerza wymiar z hidden_size do intermediate_size
- **14336** = intermediate_size (wymiar wyj≈õciowy)
- **4096** = hidden_size (wymiar wej≈õciowy)
- Macierz: `up = input @ up_proj` gdzie input ma kszta≈Çt `[batch, seq_len, 4096]`
- **Uwaga:** W SwiGLU: `output = (gate * up) @ down_proj`, gdzie gate i up sƒÖ wynikami gate_proj i up_proj

### 6. `post_attention_layernorm.weight`: `[4096]`
**Funkcja:** Layer Normalization po attention, przed MLP
- **4096** = hidden_size (normalizacja dzia≈Ça na ca≈Çym wektorze ukrytym)

### 7. `self_attn.k_proj.weight`: `[1024, 4096]`
**Funkcja:** Projekcja Key w attention - tworzy wektory Key
- **1024** = kv_dim = num_key_value_heads * head_dim = 8 * 128 (GQA - mniej g≈Ç√≥w dla K/V)
- **4096** = hidden_size (wymiar wej≈õciowy)
- Macierz: `K = input @ k_proj.T` gdzie input ma kszta≈Çt `[batch, seq_len, 4096]`

### 8. `self_attn.o_proj.weight`: `[4096, 4096]`
**Funkcja:** Output projection w attention - ≈ÇƒÖczy wyniki z wszystkich g≈Ç√≥w attention
- **4096** = hidden_size (wymiar wej≈õciowy i wyj≈õciowy)
- **4096** = q_dim = num_attention_heads * head_dim = 32 * 128 (po≈ÇƒÖczone g≈Çowy)
- Macierz: `output = attention_output @ o_proj` gdzie attention_output ma kszta≈Çt `[batch, seq_len, 4096]`

### 9. `self_attn.q_proj.weight`: `[4096, 4096]`
**Funkcja:** Projekcja Query w attention - tworzy wektory Query
- **4096** = q_dim = num_attention_heads * head_dim = 32 * 128 (wszystkie g≈Çowy Query)
- **4096** = hidden_size (wymiar wej≈õciowy)
- Macierz: `Q = input @ q_proj.T` gdzie input ma kszta≈Çt `[batch, seq_len, 4096]`

### 10. `self_attn.v_proj.weight`: `[1024, 4096]`
**Funkcja:** Projekcja Value w attention - tworzy wektory Value
- **1024** = kv_dim = num_key_value_heads * head_dim = 8 * 128 (GQA - mniej g≈Ç√≥w dla K/V)
- **4096** = hidden_size (wymiar wej≈õciowy)
- Macierz: `V = input @ v_proj.T` gdzie input ma kszta≈Çt `[batch, seq_len, 4096]`

### Dlaczego GQA (Grouped Query Attention)?
W modelach takich jak Mistral/Llama 3.1 u≈ºywa siƒô **Grouped Query Attention**, gdzie:
- **Query** ma 32 g≈Çowy (num_attention_heads) ‚Üí q_dim = 4096
- **Key/Value** ma tylko 8 g≈Ç√≥w (num_key_value_heads) ‚Üí kv_dim = 1024

To redukuje pamiƒôƒá i obliczenia, poniewa≈º Key i Value sƒÖ wsp√≥≈Çdzielone miƒôdzy grupami g≈Ç√≥w Query, zachowujƒÖc podobnƒÖ jako≈õƒá modelu.


In [5]:
import pandas as pd
from pathlib import Path
import json
import os

def calculate_tensor_dimensions(config, tensor_name):
    """Oblicza wymiary tensora na podstawie konfiguracji modelu."""
    vocab_size = config.get('vocab_size', 0)
    hidden_size = config.get('hidden_size', 0)
    intermediate_size = config.get('intermediate_size', 0)
    num_attention_heads = config.get('num_attention_heads', 0)
    num_key_value_heads = config.get('num_key_value_heads', num_attention_heads)
    
    # Oblicz head_dim
    head_dim = hidden_size // num_attention_heads if num_attention_heads > 0 else 0
    q_dim = num_attention_heads * head_dim
    kv_dim = num_key_value_heads * head_dim
    
    tensor_dims = {
        'embed_tokens.weight': [vocab_size, hidden_size],
        'input_layernorm.weight': [hidden_size],
        'mlp.down_proj.weight': [hidden_size, intermediate_size],
        'mlp.gate_proj.weight': [intermediate_size, hidden_size],
        'mlp.up_proj.weight': [intermediate_size, hidden_size],
        'post_attention_layernorm.weight': [hidden_size],
        'self_attn.k_proj.weight': [kv_dim, hidden_size],
        'self_attn.o_proj.weight': [hidden_size, q_dim],
        'self_attn.q_proj.weight': [q_dim, hidden_size],
        'self_attn.v_proj.weight': [kv_dim, hidden_size],
    }
    
    return tensor_dims.get(tensor_name, ['?', '?'])

def get_model_name_from_filename(filename):
    """WyciƒÖga nazwƒô modelu z nazwy pliku."""
    # Usu≈Ñ '-config.json' i zwr√≥ƒá resztƒô
    return filename.replace('-config.json', '')

# Wczytaj wszystkie pliki konfiguracyjne
base = Path(os.getcwd()) / 'hf-configs'
pattern = "*-config.json"
matches = sorted(base.glob(pattern))

# Przygotuj s≈Çownik z danymi dla ka≈ºdego modelu
data = {}
model_configs = {}

for p in matches:
    try:
        with p.open('r', encoding='utf-8') as f:
            config = json.load(f)
        model_name = get_model_name_from_filename(p.name)
        model_configs[model_name] = config
    except Exception as e:
        print(f"B≈ÇƒÖd przy wczytywaniu {p.name}: {e}")

# Lista tensor√≥w
tensors = [
    'embed_tokens.weight',
    'input_layernorm.weight',
    'mlp.down_proj.weight',
    'mlp.gate_proj.weight',
    'mlp.up_proj.weight',
    'post_attention_layernorm.weight',
    'self_attn.k_proj.weight',
    'self_attn.o_proj.weight',
    'self_attn.q_proj.weight',
    'self_attn.v_proj.weight',
]

# Dla ka≈ºdego modelu oblicz wymiary tensor√≥w
for model_name, config in model_configs.items():
    tensor_values = []
    for tensor_name in tensors:
        dims = calculate_tensor_dimensions(config, tensor_name)
        # Formatuj jako string [dim1, dim2] lub [dim1] dla 1D
        if len(dims) == 1:
            tensor_values.append(f"[{dims[0]}]")
        else:
            tensor_values.append(f"[{dims[0]}, {dims[1]}]")
    data[model_name] = tensor_values

# Utw√≥rz DataFrame
df = pd.DataFrame(data, index=tensors)

display(df)
# display(df.T) # transpozycja (obr√≥cenie)



Unnamed: 0,Bielik-7B-Instruct-v0.1,DeepSeek-R1,Llama-3.1-8B,Mistral-7B-v0.1,Qwen2.5-7B-Instruct
embed_tokens.weight,"[32000, 4096]","[129280, 7168]","[128256, 4096]","[32000, 4096]","[152064, 3584]"
input_layernorm.weight,[4096],[7168],[4096],[4096],[3584]
mlp.down_proj.weight,"[4096, 14336]","[7168, 18432]","[4096, 14336]","[4096, 14336]","[3584, 18944]"
mlp.gate_proj.weight,"[14336, 4096]","[18432, 7168]","[14336, 4096]","[14336, 4096]","[18944, 3584]"
mlp.up_proj.weight,"[14336, 4096]","[18432, 7168]","[14336, 4096]","[14336, 4096]","[18944, 3584]"
post_attention_layernorm.weight,[4096],[7168],[4096],[4096],[3584]
self_attn.k_proj.weight,"[1024, 4096]","[7168, 7168]","[1024, 4096]","[1024, 4096]","[512, 3584]"
self_attn.o_proj.weight,"[4096, 4096]","[7168, 7168]","[4096, 4096]","[4096, 4096]","[3584, 3584]"
self_attn.q_proj.weight,"[4096, 4096]","[7168, 7168]","[4096, 4096]","[4096, 4096]","[3584, 3584]"
self_attn.v_proj.weight,"[1024, 4096]","[7168, 7168]","[1024, 4096]","[1024, 4096]","[512, 3584]"
