# Running Inference Locally

Large Language Models (LLMs) have revolutionized AI applications, but they don't always need to be accessed through cloud APIs. 

In this notebook we download, save and run LLMs locally.

In [1]:
from transformers import AutoModelForCausalLM,AutoTokenizer
import os
import re

### Running LLMs locally offers several advantages:

- **Privacy** : Your data doesn't leave your environment
- **Cost** : No per-token API charges
- **Latency** : No network delays
- **Customization** : Full control over model parameters

### However, local LLMs also have limitations:

- **Hardware requirements** : Models need sufficient RAM and GPU
- **Model size** : Smaller models fit locally but may have reduced capabilities
- **Updates** : You manage model versions yourself

In [2]:
# Set the directory to save the model
save_directory = "./downloaded_llms/distilgpt2_model"

# Create the directory if it doesn't exist
os.makedirs(save_directory, exist_ok=True)

# Load model and tokenizer
print("Downloading model from Hugging Face Hub...")
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Print model information
print()
print(f"Model name : {model_name}")
print(f"Number of parameters : {model.num_parameters()}")
print(f"Model size on disk : {model.num_parameters() * 4 / (1024 * 1024):.2f} MB (estimated)")
print()

# Save the model and tokenizer to the specified directory
print(f"Saving model to : {save_directory}")
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)
print("model and tokenizer saved successfully!")

Downloading model from Hugging Face Hub...

Model name : distilgpt2
Number of parameters : 81912576
Model size on disk : 312.47 MB (estimated)

Saving model to : ./downloaded_llms/distilgpt2_model
model and tokenizer saved successfully!


## Loading and Using a Local Model

Once the model is saved, it can be loaded from local storage instead of downloading it again. 

This is especially useful for larger models or when working in environments with limited internet access.

In [3]:
# Load model from local directory
print("Loading model from local directory...")
local_model = AutoModelForCausalLM.from_pretrained(save_directory)
local_tokenizer = AutoTokenizer.from_pretrained(save_directory)
local_tokenizer.pad_token = local_tokenizer.eos_token
print("Model loaded from local directory !")

Loading model from local directory...
Model loaded from local directory !


## Generating Text with a Local LLM

Create a text generation function that allows to control various parameters :

- **Temperature** : Controls randomness (higher = more creative, lower = more deterministic)
- **Max length** : The maximum number of tokens to generate
- **Top-p (nucleus sampling)** : Limits token selection to a subset of most likely tokens
- **Top-k** : Limits selection to the k most likely tokens

In [4]:
def generate_text(prompt, max_length=50, temperature=.8, top_p=.9, top_k=50, do_sample=True):
    """
        Generate text from a prompt with customizable parameters
        
        Args:
            prompt (str): The input text to continue
            max_length (int): Maximum length of generated text (including prompt)
            temperature (float): Higher values (>1.0) increase randomness, lower values (<1.0) make it more deterministic
            top_p (float): Nucleus sampling parameter (0-1.0)
            top_k (int): Limits selection to k most likely tokens
            do_sample (bool): If False, uses greedy decoding instead of sampling
            
        Returns:
            str: The generated text including the prompt
    """
    
    device = "cpu"
    # Prepare the inputs : str to tokens
    inputs = local_tokenizer(prompt, return_tensors="pt", return_attention_mask=True).to(device)
    
    # Generate text
    output = local_model.generate(
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_length=max_length,
        do_sample=do_sample,
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        pad_token_id=local_tokenizer.pad_token_id
    )
    
    # Decode the output
    generated_text = local_tokenizer.decode(output[0], skip_special_tokens=True)
    
    # Clean excess whitespace 
    clean_text = re.sub(r"\s+", " ", generated_text)
    
    return clean_text

### Experimenting with Different Generation Parameters

In [7]:
prompt = "Welcome to Fundamentals of LLM Engineering course. This class"

print("Example 1: Default parameters (temperature=0.8)")
print(f"Prompt: \"{prompt}\"")
print(f"Generated: {generate_text(prompt, max_length=75)}")
print("-"*100)
print()

print("Example 2: Low temperature (more deterministic)")
print(f"Prompt: \"{prompt}\"")
print(f"Generated: {generate_text(prompt, temperature=0.2, max_length=75)}")
print("-"*100)
print()

print("Example 3: High temperature (more creative/random)")
print(f"Prompt: \"{prompt}\"")
print(f"Generated: {generate_text(prompt, temperature=1.5, max_length=75)}")
print("-"*100)
print()

print("Example 4: Greedy decoding (no sampling, always selects most likely token)")
print(f"Prompt: \"{prompt}\"")
print(f"Generated: {generate_text(prompt, top_p=None, temperature=None, do_sample=False, max_length=75)}")

Example 1: Default parameters (temperature=0.8)
Prompt: "Welcome to Fundamentals of LLM Engineering course. This class"
Generated: Welcome to Fundamentals of LLM Engineering course. This class will be part of LLM Engineering. This course will be part of LLM Engineering. This course will be part of LLM Engineering.
----------------------------------------------------------------------------------------------------

Example 2: Low temperature (more deterministic)
Prompt: "Welcome to Fundamentals of LLM Engineering course. This class"
Generated: Welcome to Fundamentals of LLM Engineering course. This class is a great opportunity to learn how to build a solid, scalable, scalable, and scalable LLM engineering system. 
----------------------------------------------------------------------------------------------------

Example 3: High temperature (more creative/random)
Prompt: "Welcome to Fundamentals of LLM Engineering course. This class"
Generated: Welcome to Fundamentals of LLM Engineerin