## How to Run Hugging Face Models

What is Hugging Face?
----------------------
Hugging face is like GitHub for ML models: tons of open models (NLP, vision, audio, diffusion). You can download weights and run them locally or host them yourself. Each model repo has: weights (.bin, .safetensors, .gguf), config (config.json), tokenizer files, and training metadata. Hugging Face also builds libraries (transformers, datasets, diffusers) to run those models easily in Python.
- Pros: Variety, full control, fine-tuning, offline use, cost-efficient at scale.
- Cons: You must manage GPUs/infra and optimize for speed.

HF Models VS. API Endpoints
---------------------------
Let's compare calling hugging face models instead of via API Endpoints(e.g. OpenAI, Anthropic, Gemini, etc.). Here you never see weights—just call an API and you pay $$ per token/requests.
- Pros: Easy, reliable, no infra headaches.
- Cons: Closed, $$$, limited models, can’t customize or fine-tune, no offline use.

When to choose?
---------------
- Hugging Face → experimentation, control, cheap at scale, fine-tuning.
- API → quick use, guaranteed uptime, best single model, less hassle

In [24]:
from dotenv import load_dotenv
import os

load_dotenv()

HUGGING_FACE_API_KEY = os.getenv("HUGGING_FACE_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

### Self-Hosting Hugging Face (for maximum control and customization):
Local Deployment: Run models locally using the Transformers library for testing and development.

In [2]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "facebook/bart-large-cnn"
tok = AutoTokenizer.from_pretrained(model_id)
mdl = AutoModelForSeq2SeqLM.from_pretrained(model_id)

text = "Hugging Face hosts thousands of models... (your article here)"

inputs = tok(text, return_tensors="pt", max_length=1024, truncation=True)
out = mdl.generate(**inputs, max_new_tokens=100)

print(tok.decode(out[0], skip_special_tokens=True))


  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Hugging Face hosts thousands of models... (your article here) (your articles here) Click here for more information on how to get involved with Hugging Face. Click here to find out more about the site and to sign up for a free trial of the site. Visit www.HuggingFace.com.


In [3]:
from openai import OpenAI
client = OpenAI()

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role":"user","content":"Summarize this: Hugging Face hosts thousands of models..."}],
    max_tokens=100
)

print(resp.choices[0].message.content)


Hugging Face provides a platform with thousands of models for natural language processing (NLP) tasks, fostering collaboration and accessibility for developers and researchers.


### 1) Hugging Face Inference API (managed, like OpenAI)

In [17]:
from huggingface_hub import InferenceClient
client = InferenceClient(api_key=HUGGING_FACE_API_KEY)
resp = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[{"role":"user","content":"How many 'G's in 'huggingface'?"}],
    max_tokens=16,
)
print(resp.choices[0].message.content)

 The word 'huggingface' contains three occurences of the letter


In [15]:
from huggingface_hub import InferenceClient

client = InferenceClient(api_key=HUGGING_FACE_API_KEY)

messages = [
    {"role": "user", "content": "How many 'G's in 'huggingface'?"}
]

completion = client.chat.completions.create(model="deepseek-ai/DeepSeek-V3-0324", messages=messages)
print(completion.choices[0].message.content)

Alright, let's tackle the problem: **How many 'G's are in the word 'huggingface'?**

### Understanding the Problem
First, I need to understand what the question is asking. We're given the word "huggingface," and we need to count how many times the letter 'G' (both uppercase and lowercase, though here it's all lowercase) appears in it.

### Breaking Down the Word
Let's write out the word and look at each letter one by one.

The word is: h u g g i n g f a c e

Let's index each letter for clarity:
1. h
2. u
3. g
4. g
5. i
6. n
7. g
8. f
9. a
10. c
11. e

Now, let's go through each position and see if the letter is 'g':

1. h - not g
2. u - not g
3. g - yes (1st g)
4. g - yes (2nd g)
5. i - not g
6. n - not g
7. g - yes (3rd g)
8. f - not g
9. a - not g
10. c - not g
11. e - not g

### Counting the 'G's
From the above, we can see that 'g' appears at positions:
- 3rd letter
- 4th letter
- 7th letter

That's a total of 3 times.

### Verifying
Just to be sure, let's read the word again: "hugg

In [26]:
from openai import OpenAI
client = OpenAI(api_key="OPENAI_API_KEY")
r = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role":"user","content":"How many 'G's in 'huggingface'?"}],
    max_tokens=16,
)
print(r.choices[0].message.content)

AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: OPENAI_A**_KEY. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

### 2) Hugging Face Transformers (local weights + tokenizer)

Downloads weights & tokenizer and runs on your machine. Using Mistral-7B-Instruct:

In [27]:
!pip install llama-cpp-python

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.3.16.tar.gz (50.7 MB)
     ---------------------------------------- 0.0/50.7 MB ? eta -:--:--
     - -------------------------------------- 1.3/50.7 MB 6.1 MB/s eta 0:00:09
     -- ------------------------------------- 2.6/50.7 MB 6.6 MB/s eta 0:00:08
     --- ------------------------------------ 3.9/50.7 MB 6.5 MB/s eta 0:00:08
     ---- ----------------------------------- 5.5/50.7 MB 6.6 MB/s eta 0:00:07
     ----- ---------------------------------- 6.6/50.7 MB 6.4 MB/s eta 0:00:07
     ------ --------------------------------- 7.9/50.7 MB 6.2 MB/s eta 0:00:07
     ------- -------------------------------- 8.9/50.7 MB 6.1 MB/s eta 0:00:07
     -------- ------------------------------- 10.2/50.7 MB 6.0 MB/s eta 0:00:07
     -------- ------------------------------- 11.3/50.7 MB 6.0 MB/s eta 0:00:07
     --------- ------------------------------ 12.3/50.7 MB 5.9 MB/s eta 0:00:07
     ---------- ----------------------------- 13.6/5

In [34]:
from llama_cpp import Llama

# 1. Load the quantized model (GGUF format)
llm = Llama(
    model_path="Llama-3.2-1B-Instruct-Q4_K_M.gguf",  # point to your downloaded model file
    n_ctx=2048                                            # max context length (# of tokens it can "remember")
)

# 2. Define your prompt
prompt = "How many 'G's in 'huggingface'?"

# 3. Run inference
out = llm(prompt, max_tokens=250)

# 4. Print the model's answer
print(out["choices"][0]["text"].strip())

llama_model_loader: loaded meta data with 35 key-value pairs and 147 tensors from Llama-3.2-1B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 1B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 1B
llama_model_loader: - kv   6:                            general.license str              = llama3.2
llama_model_loader: - kv   7:                     

Are there any 'G's in 'growing'? 

## Step 1: Count the number of 'G's in 'huggingface'
There is 1 'G' in 'huggingface'.

## Step 2: Count the number of 'G's in 'growing'
There are 2 'G's in 'growing'.

## Step 3: Calculate the total number of 'G's
Total number of 'G's = 1 (from 'huggingface') + 2 (from 'growing') = 3.

The final answer is: $\boxed{3}$


In [1]:
# CPU-only quick start
!pip install "transformers>=4.42" torch --upgrade

Collecting transformers>=4.42
  Downloading transformers-4.55.2-py3-none-any.whl.metadata (41 kB)
Collecting torch
  Downloading torch-2.8.0-cp311-cp311-win_amd64.whl.metadata (30 kB)
Collecting filelock (from transformers>=4.42)
  Downloading filelock-3.19.1-py3-none-any.whl.metadata (2.1 kB)
Collecting huggingface-hub<1.0,>=0.34.0 (from transformers>=4.42)
  Using cached huggingface_hub-0.34.4-py3-none-any.whl.metadata (14 kB)
Collecting regex!=2019.12.17 (from transformers>=4.42)
  Downloading regex-2025.7.34-cp311-cp311-win_amd64.whl.metadata (41 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers>=4.42)
  Using cached tokenizers-0.21.4-cp39-abi3-win_amd64.whl.metadata (6.9 kB)
Collecting safetensors>=0.4.3 (from transformers>=4.42)
  Downloading safetensors-0.6.2-cp38-abi3-win_amd64.whl.metadata (4.1 kB)
Collecting fsspec>=2023.5.0 (from huggingface-hub<1.0,>=0.34.0->transformers>=4.42)
  Using cached fsspec-2025.7.0-py3-none-any.whl.metadata (12 kB)
Collecting sympy>=1.13.3 

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32)
model.eval()  # CPU

msgs = [{"role":"user","content":"How many 'G's in 'huggingface'?"}]
prompt = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)

inputs = tok(prompt, return_tensors="pt")
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=200, temperature=0.2)

print(tok.decode(out[0], skip_special_tokens=True))

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


<|user|>
How many 'G's in 'huggingface'? 
<|assistant|>
The 'G' in 'huggingface' is a capital 'G' that is not a part of the actual word.
