# 3 - Hugging Face Model Repository

<center>
    <img src="../img/HF-logo-horizontal.png" width=600>
</center>

First, it's important to understand that [Hugging Face https://huggingface.co/](https://huggingface.co/) provides a [vast ecosystem](https://huggingface.co/welcome) for Machine Learning & AI practitioners:

* A [repository](https://huggingface.co/models) of ML/AI models
* A suite of [software libraries](https://huggingface.co/docs) (mostly in Python and JS)
* Over 500k [datasets](https://huggingface.co/datasets) suitable for training or testing
* ML/AI [training content](https://huggingface.co/learn)
* ML/AI [research publications](https://huggingface.co/papers)
* Model hosting UI in [HF Spaces](https://huggingface.co/spaces)
* Personal & Organizational [Hugging Face Hub](https://huggingface.co/docs/hub/en/index) to coordiante your use of AI models (local or cloud)
* Community [forum](https://discuss.huggingface.co/) for discussion and a [blog](https://huggingface.co/blog)

You should definitely explore it more, but for us we'll just focus on the [model repository https://huggingface.co/models](https://huggingface.co/models).

## 3.1 Find a GGUF model

[*GGUF*](https://huggingface.co/docs/hub/en/gguf) (GPT-Generated Unified Format) model files provide a one-file version of a model that can be used with Ollama, LLama.cpp, and other LLM hosting frameworks.  There are other model formats, such as [*Safetensors*](https://huggingface.co/docs/safetensors/en/index), but they require more steps to work with for self-hosting, so we'll stick with GGUF for now.

Hugging Face has over [150k GGUF format models](https://huggingface.co/models?library=gguf), so you need to be selective in which ones are suitable for your needs.

[https://huggingface.co/models?library=gguf](https://huggingface.co/models?library=gguf)

For our purposes, let's look at the [Phi3 Mini 4B parameter model](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf) from Microsoft.

You can right away start downloading the Q4 GGUF file which is about 2.4GB, and cross your fingers it downloads while we read through the model card.  You should download this file to the `modelfiles` folder in the tutorial repository.

[https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/tree/main](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/tree/main)

## 3.2 Create a `Modelfile` for the GGUF

The *Model Card* for Phi3 Mini specified that the the `Modelfile` can be downloaded with this command:

```
hf download microsoft/Phi-3-mini-4k-instruct-gguf Modelfile_q4 --local-dir
```

However if you look in the *"Files and versions"* area of the model repository you'll also see you can directly download the file from the web interface.  A version of it is already available in the `model_import` folder in the tutorial repository.

`phi3-4b.modelfile`:
```
FROM ./Phi-3-mini-4k-instruct-q4.gguf

TEMPLATE """<s>{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>"""

PARAMETER stop <|endoftext|>
PARAMETER stop <|assistant|>
PARAMETER stop <|end|>

PARAMETER num_ctx 4096
```

Now create the Ollama model from the modelfile:
```
ollama create phi3-mini:4b -f phi3-4b.modelfile
```

In [1]:
import ollama

In [2]:
%%time
response = ollama.generate(model='phi3-mini:4b',
                           prompt='What are three things I should do this week?')

CPU times: user 2.79 ms, sys: 6.93 ms, total: 9.72 ms
Wall time: 12.1 s


In [3]:
print(response.response)

 Here are three productive activities you can consider doing within the next week:

1. Learn a new skill or hobby: Dedicate some time to learn something that interests you, whether it's cooking, painting, playing an instrument, coding, or learning a foreign language. Pick one specific area and set aside dedicated time each day for practice. There are numerous resources available online like tutorials, courses on platforms such as Udemy, Coursera, or Skillshare that cater to different skill sets.

2. Organize your personal life: Set goals around improving various aspects of your personal well-being and daily routines. This can include setting a fitness routine, meal planning, reorganizing your living space, decluttering digital files or implementing a better sleep schedule. These changes can significantly improve overall quality of life, productivity, and mental health.

3. Contribute to the community: Look for ways you can give back within your local area. This might include volunteeri

## 3.3 Serve GGUF model with Llama.cpp

There are several tools you can use for serving local LLMs.  Ollama is a versatile all-in-one framework with UI, CLI, API, and Python libraries that can support both local LLMs and cloud-hosted LLMs, with no constraints on personal or professional/commercial use.

However Llama.cpp is a performant and more production-oriented LLM server (and Ollama is built on top of it).

```
llama-server -m ./Phi-3-mini-4k-instruct-q4.gguf --jinja -c 0 --host 127.0.0.1 --port 8080```

Llama.cpp will provide a web interface to the served model(s), in this case available at:

[http://127.0.0.1:8080](http://127.0.0.1:8080)

## 3.4 Use the `openai` Python library to access the local LLM server

In [4]:
from openai import OpenAI

In [5]:
client = OpenAI(base_url="http://127.0.0.1:8080/v1",
                # no API key needed when accessing Llama.cpp LLM server
                api_key="dummy"
)

In [6]:
messages = [
        dict(role="system", content="You are a helpful assistant."),
        dict(role="user",   content="What are the top 3 things to know about LLMs?")
            ]

In [7]:
%%time
response = client.chat.completions.create(
    model="dummy",  # Llama.cpp can only serve one model at a time, so this is ignored
    messages=messages
)

CPU times: user 80 ms, sys: 29 ms, total: 109 ms
Wall time: 22.8 s


The actual response text is buried several layers in:

In [8]:
print(response.choices[0].message.content)

 Large Language Models (LLMs) are a type of Artificial Intelligence (AI) models that have made significant advancements in natural language processing (NLP) and understanding. Here are the top three things you should know about LLMs:

1. Basics and Components:
   a. Definition: Large Language Models are deep learning-based models that use neural network architectures, such as Transformer and its variants (e.g., GPT-3, BERT, T5, etc.), to process and analyze human language. These models are designed to understand, interpret, and generate human language in a way that is coherent, relevant, and contextually appropriate.

   b. Components: LLMs are composed of three main components - an input layer, hidden layers, and an output layer. The input layer processes the input text or sequence of tokens, while hidden layers (stacked multiple layers) capture the hierarchical and contextual relationships between tokens, and the output layer generates the desired output, such as a continuation of th

In [9]:
messages.append(dict(role="user", content="Please expand on the last point"))

In [10]:
%%time
response = client.chat.completions.create(
    model="dummy",  # Llama.cpp can only serve one model at a time, so this is ignored
    messages=messages
)

CPU times: user 5.51 ms, sys: 9.32 ms, total: 14.8 ms
Wall time: 25.8 s


In [None]:
print(response.choices[0].message.content)

### EXERCISE: Fetch an LLM from HF and serve it with Ollama and Llama.cpp

*(10 minutes)*

Using the HuggingFace verion of the TinyLlama 1.1B model found here:

[https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF)

Select the `Q4_K_M` GGUF variant from the files section.  This indicates:

* `Q4` - 4-bit quantization (precision reduction)
* `K` - K-means clustering of quantization group weights
* `M` - Medium precision quantization (trade-off in size & performance/quality)

1. Create a simple 1-line `Modelfile`
2. Create the Ollama model
3. Run the Ollama model & experiment with a chat session
4. Serve the GGUF model through Llama.cpp and connect to it from the Python `openai` library