# Quantized Models from the Hugging Face Community
The other variations that interest us are based on the GGUFlibrary.

We can see the different variations that GEMMA-GGML [has here](https://huggingface.co/models?sort=trending&search=gemma+gguf).

In [None]:
# With NVidia CUDA acceleration
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.46.tar.gz (36.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m36.7/36.7 MB[0m [31m39.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.2.46-cp310-cp310-manylinux_2_35_x86_64.whl size=22741834 sha256=138ce449964e303e9e63381dcf7d79cbc632fa13893876790afdd6aff6684702
  Stored 

In [None]:
from huggingface_hub import hf_hub_download

from llama_cpp import Llama

In [None]:
model_name_or_path = "brittlewis12/gemma-2b-GGUF"
model_basename = "gemma-2b.Q8_0.gguf" # the model is in bin format

In [None]:
model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)

gemma-2b.Q8_0.gguf:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

In [None]:
# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = Llama(
  model_path=model_path,  # Download the model file first
  n_ctx=32768,  # The max sequence length to use - note that longer sequence lengths require much more resources
  n_threads=8,            # The number of CPU threads to use, tailor to your system and the resulting performance
  n_gpu_layers=35         # The number of layers to offload to GPU, if you have GPU acceleration available
)

llama_model_loader: loaded meta data with 21 key-value pairs and 164 tensors from /root/.cache/huggingface/hub/models--brittlewis12--gemma-2b-GGUF/snapshots/e353e5f9dcff7ae4b11ba3c065f1f6ab4c480423/gemma-2b.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma
llama_model_loader: - kv   1:                               general.name str              = gemma-2b
llama_model_loader: - kv   2:                       gemma.context_length u32              = 8192
llama_model_loader: - kv   3:                          gemma.block_count u32              = 18
llama_model_loader: - kv   4:                     gemma.embedding_length u32              = 2048
llama_model_loader: - kv   5:                  gemma.feed_forward_length u32              = 16384
llama_model_loader: - kv   6:                 gemma.attention.head_count

Text completion

In [None]:
prompt =  "Simply put, the theory of relativity states that "
prompt_template=f'{prompt}'

response=llm(prompt=prompt_template, max_tokens=512, temperature=0.7, top_p=0.95,
                  repeat_penalty=1.1, top_k=150,
                  echo=True)


Llama.generate: prefix-match hit

llama_print_timings:        load time =      91.04 ms
llama_print_timings:      sample time =    3046.05 ms /   512 runs   (    5.95 ms per token,   168.09 tokens per second)
llama_print_timings: prompt eval time =      86.36 ms /    10 tokens (    8.64 ms per token,   115.79 tokens per second)
llama_print_timings:        eval time =    9236.03 ms /   511 runs   (   18.07 ms per token,    55.33 tokens per second)
llama_print_timings:       total time =   24410.64 ms /   521 tokens


In [None]:
print(response["choices"][0]["text"])

Simply put, the theory of relativity states that <strong>the faster an object moves relative to another object, the slower its motion appears to other objects.</strong>

The difference between light-time and time dilation is that light-time only affects the direction of travel; time dilation also changes the rate at which clocks tick (time).  This means that if two observers measure different rates for the same physical event, one must be moving with respect to the other.

The effects of time dilation are most noticeable for observers traveling at relativistic speeds, or close to it.  If two observers are moving in opposite directions with respect to each other (i.e. one is approaching from behind), they will experience time dilations with respect to each other’s clocks but not with respect to their own.  As a result, it would appear that one observer’s clock ticks more slowly than the other’s (as measured by his/her own clock).

Time dilation can be observed in many ways, including:



Chat

In [None]:
prompt = "Generate a project idea that helps the kids how to read"
prompt_template=f'''SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: {prompt}

ASSISTANT:
'''

response=llm(prompt=prompt_template, max_tokens=256, temperature=0.7, top_p=0.95,
                  repeat_penalty=1.1, top_k=150,
                  echo=True)


Llama.generate: prefix-match hit

llama_print_timings:        load time =      91.04 ms
llama_print_timings:      sample time =    1320.95 ms /   223 runs   (    5.92 ms per token,   168.82 tokens per second)
llama_print_timings: prompt eval time =      73.34 ms /    41 tokens (    1.79 ms per token,   559.06 tokens per second)
llama_print_timings:        eval time =    4010.12 ms /   222 runs   (   18.06 ms per token,    55.36 tokens per second)
llama_print_timings:       total time =   10605.13 ms /   263 tokens


In [None]:
print(response["choices"][0]["text"])

SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: Write a poem that helps me to remember 10 elements of periodic table

ASSISTANT:
1. The Periodic Table shows the relationship between different elements. It is arranged in order of increasing atomic number.
2. The vertical columns are called families or periods. Each family consists of all elements having similar properties.
3. The horizontal rows are called groups or periods. Each group consists of all elements having similar properties.
4. The element with smallest atomic number is Hydrogen (H) while the element with largest atomic number is Uranium (U).
5. The element with smallest atomic mass is Lithium (Li) while the element with largest atomic mass is Uranium (U).
6. The element with smallest atomic weight is Helium (He) while the element with largest atomic weight is Uranium (U).
7. The element with smallest atomic number is Hydrogen (H) while the element with largest atomic number is 