### The purpose of this notebook
Cyber Agent社が公開した日本語特化LLM [Open-CALM](https://huggingface.co/cyberagent/open-calm-7b) のテスト

| Model                        | Params | Layers | Dim  | Heads | Dev | ppl |
|------------------------------|--------|--------|------|-------|-----|-----|
| cyberagent/open-calm-small   | 160M   | 12     | 768  | 12    |19.7 |     |
| cyberagent/open-calm-medium  | 400M   | 24     | 1024 | 16    |13.8 |     |
| cyberagent/open-calm-large   | 830M   | 24     | 1536 | 16    |11.3 |     |
| cyberagent/open-calm-1b      | 1.4B   | 24     | 2048 | 16    |10.3 |     |
| cyberagent/open-calm-3b      | 2.7B   | 32     | 2560 | 32    |9.7  |     |
| cyberagent/open-calm-7b      | 6.8B   | 32     | 4096 | 32    |8.2  |     |

In [13]:
# Import libraries
import torch
import torchvision
import numpy as np
import pandas as pd
import matplotlib.cm as cm
import matplotlib.pyplot as plt

from tqdm.notebook import tqdm

# Import GPT-2 and stable diffusion decoders
from diffusers import StableDiffusionPipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, GPT2Tokenizer, GPT2LMHeadModel, GPT2Model

torch.manual_seed(234)

# Define parameters and functions for LLM
tokenizer = AutoTokenizer.from_pretrained("cyberagent/open-calm-7b")
model = AutoModelForCausalLM.from_pretrained('cyberagent/open-calm-7b')
lm_model = AutoModelForCausalLM.from_pretrained("cyberagent/open-calm-7b", output_hidden_states=True)
# Add a [PAD] token to the tokenizer
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token':'[PAD]'})
model.eval()
lm_model.eval()

prompt = str("林檎,梨, 苺, 3つのフルーツのうち, 仲間外れはどれですか？")

# Encode the prompt into tokens
input_ids = tokenizer.encode(prompt, return_tensors="pt")

def lm_generate(input_ids):
    # Generate tokens from the prompt using LLM
    gen_output = model(input_ids)
    gen_embeddings = lm_model(input_ids,
                              labels=input_ids)                             
    gen_token = lm_model.generate(input_ids,
                                  do_sample=True,
                                  temperature=0.9,
                                  max_length=100,
                                  labels=input_ids)
    return gen_output, gen_embeddings, gen_token

# Generate a text from text using LLM
generated_output, generated_embeddings, generated_token = lm_generate(input_ids)

lm_output = generated_output[0]
lm_hidden_states = generated_embeddings[3]
lm_text = tokenizer.batch_decode(generated_token)[0]

# Iterate over the hidden states and the input ids to print the hidden state for each token
for hidden_state, input_id in tqdm(zip(lm_hidden_states, input_ids[0])):
    print(f"Token: {tokenizer.decode([input_id])}, Hidden State: {hidden_state}")
print('----------------------------------')
# Print the generated text
print(lm_text)


Downloading (…)okenizer_config.json:   0%|          | 0.00/323 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/3.23M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/129 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/611 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/42.0k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.93G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/3.95G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


0it [00:00, ?it/s]

Token: 林, Hidden State: tensor([[[ 0.0894,  0.0757, -0.0100,  ..., -0.0243, -0.0881, -0.0027],
         [ 0.0706, -0.0511, -0.0185,  ..., -0.0054,  0.0542, -0.0295],
         [ 0.0045,  0.0465, -0.0839,  ..., -0.0045, -0.0135,  0.0774],
         ...,
         [ 0.0373, -0.0401,  0.0336,  ...,  0.0166, -0.0238, -0.0500],
         [-0.0229, -0.1041, -0.0446,  ..., -0.1105,  0.0571,  0.0007],
         [-0.1040, -0.1261, -0.0040,  ...,  0.0894,  0.0261,  0.0858]]],
       grad_fn=<EmbeddingBackward0>)
Token: 檎, Hidden State: tensor([[[ 0.6182,  0.5550, -0.0808,  ..., -1.2450,  0.4577,  2.0946],
         [ 0.1640, -1.4763,  0.6434,  ...,  0.1246,  0.0356, -2.7210],
         [-0.6943, -0.3979, -0.9238,  ...,  0.2434, -1.2330, -0.3016],
         ...,
         [ 1.2663, -0.4101, -0.2816,  ...,  1.2356,  0.5992, -0.0728],
         [-1.3053, -0.8954,  0.4181,  ...,  0.1440,  0.6385, -0.2179],
         [-0.6848, -1.1187, -0.4895,  ...,  0.1540, -0.3075, -1.2029]]],
       grad_fn=<AddBackward0>)
