### TinyStories

https://huggingface.co/roneneldan

https://arxiv.org/abs/2305.07759

https://github.com/karpathy/llama2.c

#### Load model and predict in python

In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

import torch
# Check if a GPU is available
if torch.cuda.is_available():
    # Get the current device index (default is 0 if no other device is specified)
    current_device = torch.cuda.current_device()
    
    # Get the name of the GPU at this device index
    gpu_name = torch.cuda.get_device_name(current_device)
    print(f"Current GPU: {gpu_name}")
else:
    print("No GPU available.")

Current GPU: Tesla P40


In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

model = AutoModelForCausalLM.from_pretrained('roneneldan/TinyStories-33M')
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125M")
prompt = "Once upon a time there was"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# Generate completion
output = model.generate(input_ids, max_length = 1000, num_beams=1)

# Decode the completion
output_text = tokenizer.decode(output[0], skip_special_tokens=True)

# Print the generated text
print(output_text)


pytorch_model.bin:  47%|####6     | 136M/291M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/357 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Once upon a time there was a little girl named Lucy. She was three years old and loved to explore. One day, Lucy was walking in the park when she saw a big, red balloon. She was so excited and ran over to it.

"Can I have it?" she asked.

"No," said her mom. "It's too big for you. You can't have it."

Lucy was sad, but then she saw a small, red balloon. She smiled and said, "I want that one!"

Her mom smiled and said, "Okay, let's go get it."

So they went to the balloon and Lucy was so happy. She held the balloon tight and ran around the park with it. She laughed and smiled and had so much fun.

When it was time to go home, Lucy hugged the balloon and said, "I love you, balloon!"

Her mom smiled and said, "I love you too, Lucy."



#### Export model to *.bin file

In [12]:
import torch

# Save the model weights
torch.save(model.state_dict(), "model_weights.bin")

print("Export successful!")

Export successful!


```bash
./run model_weights.bin
Floating point exception (core dumped)
```

# Run.c

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/karpathy/llama2.c/blob/master/run.ipynb)

More details can be found in the [README.md](README.md) .

In [4]:
#@title Clone Project

!git clone https://github.com/karpathy/llama2.c.git
%cd llama2.c

Cloning into 'llama2.c'...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


remote: Enumerating objects: 1517, done.[K
remote: Counting objects: 100% (10/10), done.[K
remote: Compressing objects: 100% (10/10), done.[K
remote: Total 1517 (delta 4), reused 4 (delta 0), pack-reused 1507 (from 1)[K
Receiving objects: 100% (1517/1517), 1.23 MiB | 4.62 MiB/s, done.
Resolving deltas: 100% (931/931), done.
/home/loc/Works/llm-playground/notebooks/Experiments/llama2.c


In [2]:
%cd llama2.c
!pwd

/home/loc/Works/llm-playground/notebooks/Experiments/llama2.c
/home/loc/Works/llm-playground/notebooks/Experiments/llama2.c


In [3]:
#@title Build

!make runfast

gcc -Ofast -o run run.c -lm
gcc -Ofast -o runq runq.c -lm


In [7]:
#@title Pick Your Model

#@markdown Choose model
model = "stories15M" #@param ["stories15M", "stories42M", "stories110M"]

download_url = ""

if(model == "stories15M"):
  download_url = "https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin"
if(model == "stories42M"):
  download_url = "https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin"
if(model == "stories110M"):
  download_url = "https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin"

print(f"download_url: {download_url}")

!wget $download_url

model_file = model + ".bin"

download_url: https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
--2024-11-10 14:40:11--  https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
Resolving huggingface.co (huggingface.co)... 

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


2600:9000:26b8:1800:17:b174:6d00:93a1, 2600:9000:26b8:da00:17:b174:6d00:93a1, 2600:9000:26b8:ec00:17:b174:6d00:93a1, ...
Connecting to huggingface.co (huggingface.co)|2600:9000:26b8:1800:17:b174:6d00:93a1|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.hf.co/repos/88/4b/884bade32e5ee32eea725c5087af1358179a1bea94a4f6abc3c0470c9610ac38/cd590644d963867a2b6e5a1107f51fad663c41d79c149fbecbbb1f95fa81f49a?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27stories15M.bin%3B+filename%3D%22stories15M.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1731483616&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczMTQ4MzYxNn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy84OC80Yi84ODRiYWRlMzJlNWVlMzJlZWE3MjVjNTA4N2FmMTM1ODE3OWExYmVhOTRhNGY2YWJjM2MwNDcwYzk2MTBhYzM4L2NkNTkwNjQ0ZDk2Mzg2N2EyYjZlNWExMTA3ZjUxZmFkNjYzYzQxZDc5YzE0OWZiZWNiYmIxZjk1ZmE4MWY0OWE%7EcmVzcG9uc2UtY29udGVudC1kaXNw

In [8]:
#@title Generate Stories

# Generate args
max_token = 256 #@param {type:"slider", min:32, max:1024, step:32}
temperature = 0.8 #@param {type:"slider", min:0.0, max:1, step:0.05}
top_p = 0.9 #@param {type:"slider", min:0.0, max:1.0, step:0.05}
prompt = "One day, Lily met a Shoggoth" #@param {type:"string"}

print(f"model: {model_file}, max_token: {max_token}, temperature: {temperature}, top_p: {top_p}, prompt: {prompt}")
print(f"----------------------------\n")

cmd = f'./run {model_file} -t {temperature} -p {top_p} -n {max_token} -i "{prompt}"'
!{cmd}

model: stories15M.bin, max_token: 256, temperature: 0.8, top_p: 0.9, prompt: One day, Lily met a Shoggoth
----------------------------

One day, Lily met a Shoggoth. He was a bright green and shiny thing. He smiled and said, "Hi, I'm Shog. I'm a reliable friend. I always help you. Do you

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


 want to play with me?" Lily nodded and smiled back. She said, "Yes, I want to play with you. You are a reliable friend."
They played in the park and laughed. They had fun. They were happy. Lily's mom watched them and said, "Wow, you two are very good at playing. You are very good friends. You are both very reliable. You always help each other and share your toys. That is very kind."
Lily and Shog looked at each other. They said, "You are right. Shog is a reliable friend. And you are very reliable. Thank you for sharing." They hugged and said goodbye. They went back to their mom. They told her what they learned. She was proud of them. She said, "That's wonderful. You are very smart and friendly. You are very good friends." She hugged them and kissed them. They hugged her back and said
achieved tok/s: 359.154930


```bash
#@title Run Meta's Llama 2 models

#@markdown input your huggingface [access token](https://huggingface.co/settings/tokens) to download Meta's Llama 2 models.

from huggingface_hub import snapshot_download

token = "replace your huggingface access token" #@param {type:"string"}
path = snapshot_download(repo_id="meta-llama/Llama-2-7b",cache_dir="Llama-2-7b", use_auth_token=token)

!python export.py llama2_7b.bin --meta-llama $path

print("./run llama2_7b.bin\n")
!./run llama2_7b.bin
```