LLMs can be either closed or open source. When saying "close" we mean that they are proprietary. With this, it is usually meant that the parameters are not shared with us. 


Hugging Face is the powerful library where we can find a lot of models, and various frameworks for working with LLMs exist.

although, clearly, there are some benefits with using proprietary LLMs (such as ChatGPT or Claude). One of these is that you don't have to use your own computer power, but you can rather use these giants' computing power and server.

One open source model I suggest is using the 
### **Phi-3-mini**
which is a relatively small (3.8 billion parameters) but quite performative model. Due to its small size, the model can be run on devices with less than 8 GB of VRAM. If you performa quantization you can use even less that 6 GB of VRAM. Additionally, the model is licensed under the MIT license, which allows the model to be used for commercial purposs without constraints.

(VRAM is different from RAM: while RAM is used among all the processes (sucha apps) in the computer, the VRAM is used specifically for the GPU-heavy tasks). In the case of the newest macbooks though, that are using the MX processors, the ram is "unified" / "shared" among the CPU and the GPU.


Another advantage of the closed source models, usually, is that they have a nice interface such as ChatGPT . Sometimes all we just want is a ChatGPT-like interface with a local LLM. Fortunately, there are many incredible frameworks that allow for this. A few examples include **text-generation-webui, KoboldCpp, and LM Studio.**




running the following code 


When you use an LLM, two models are loaded:
- The generative model itself
- Its underlying tokenizer. (remember we've seen here: [link] that how to tokenize is not obvious). The tokenizer is in charge of splitting the input text into tokens before feeding it to the generative mode.

Running the following code will start downloading the model and depending on your internet connection can take a couple of minutes.



In [1]:
"""#if you do have a NVIDIA GPU use this code:


from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",

    device_map="cuda",

    torch_dtype="auto",

    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")"""

'#if you do have a NVIDIA GPU use this code:\n\n\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n# Load model and tokenizer\nmodel = AutoModelForCausalLM.from_pretrained(\n    "microsoft/Phi-3-mini-4k-instruct",\n\n    device_map="cuda",\n\n    torch_dtype="auto",\n\n    trust_remote_code=True,\n)\ntokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")'

In [None]:
# if you do have a Macbook MX run this code

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Detect if MPS (Apple Silicon GPU) is available
if torch.backends.mps.is_available():
    device = "mps"   # Use Apple GPU
else:
    device = "cpu"   # Fallback

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    torch_dtype=torch.float16,   # works better on MPS
    trust_remote_code=True,
)
model = model.to(device)

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

configuration_phi3.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

{"timestamp":"2025-08-20T07:13:25.820325Z","level":"WARN","fields":{"message":"Reqwest(reqwest::Error { kind: Request, source: hyper_util::client::legacy::Error(Connect, Custom { kind: Other, error: Custom { kind: UnexpectedEof, error: \"tls handshake eof\" } }) }). Retrying..."},"filename":"/Users/runner/work/xet-core/xet-core/cas_client/src/http_client.rs","line_number":242}
{"timestamp":"2025-08-20T07:13:25.820391Z","level":"WARN","fields":{"message":"Reqwest(reqwest::Error { kind: Request, source: hyper_util::client::legacy::Error(Connect, Custom { kind: Other, error: Custom { kind: UnexpectedEof, error: \"tls handshake eof\" } }) }). Retrying..."},"filename":"/Users/runner/work/xet-core/xet-core/cas_client/src/http_client.rs","line_number":242}
{"timestamp":"2025-08-20T07:13:25.820458Z","level":"WARN","fields":{"message":"Retry attempt #0. Sleeping 1.569611823s before the next attempt"},"filename":"/Users/runner/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/reqwest-retry-0.

Although we now have enough to start generating text, there is a nice trick in transformers that simplifies the process, namely transformers.pipeline.
It encapsulates the model, tokenizer, and text generation process into a single function:

In [None]:
from transformers import pipeline

# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False
)


The following parameters are worth mentioning:

- **`return_full_text`**  
  By setting this to `False`, the prompt will not be returned but merely the output of the model.

- **`max_new_tokens`**  
  The maximum number of tokens the model will generate. By setting a limit, we prevent long and unwieldy output as some models might continue generating output until they reach their context window.

- **`do_sample`**  
  Whether the model uses a sampling strategy to choose the next token. By setting this to `False`, the model will always select the next most probable token.  
  We will explore several sampling parameters that invoke some creativity in the model’s output.

---

To generate our first text, let’s instruct the model to tell a joke about chickens.  
To do so, we format the prompt in a list of dictionaries where each dictionary relates to an entity in the conversation.  
Our role is that of `"user"` and we use the `"content"` key to define our prompt:


In [None]:
# The prompt (user input / query)
messages = [
    {"role": "user", "content": "Create a funny joke about chickens."}
]

# Generate output
output = generator(messages)
print(output[0]["generated_text"])