# Running an LLM on your own laptop
In this notebook, we're going to learn how to run a Hugging Face LLM on our own machine.
https://www.youtube.com/watch?v=Ay5K4tog5NQ


## Download the LLM
We're going to write some code to manually download the model.

In [1]:
import os
from huggingface_hub import hf_hub_download 

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
from dotenv import load_dotenv
load_dotenv()

True

In [3]:
HUGGING_FACE_API_KEY = os.getenv("HUGGING_FACE_API_KEY")

In [4]:
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
filenames = [
        ".gitattributes", "LICENSE", "README.md", "USE_POLICY.md", "config.json", "generation_config.json", 
        "model-00001-of-00004.safetensors", "model-00002-of-00004.safetensors", "model-00003-of-00004.safetensors", 
        "model-00004-of-00004.safetensors", "model.safetensors.index.json", "special_tokens_map.json", "tokenizer.json", "tokenizer_config.json"

]

Local machine could handle 3b-7b parameters

In [4]:
model_id = "lmsys/fastchat-t5-3b-v1.0"
filenames = [
        "pytorch_model.bin", "added_tokens.json", "config.json", "generation_config.json", 
        "special_tokens_map.json", "spiece.model", "tokenizer_config.json"
]

In [8]:
for filename in filenames:
        downloaded_model_path = hf_hub_download(
                    repo_id=model_id,
                    filename=filename,
                    token=HUGGING_FACE_API_KEY
        )
        print(downloaded_model_path)

/Users/saivenkatreddykopparthi/.cache/huggingface/hub/models--lmsys--fastchat-t5-3b-v1.0/snapshots/0b1da230a891854102d749b93f7ddf1f18a81024/pytorch_model.bin
/Users/saivenkatreddykopparthi/.cache/huggingface/hub/models--lmsys--fastchat-t5-3b-v1.0/snapshots/0b1da230a891854102d749b93f7ddf1f18a81024/added_tokens.json
/Users/saivenkatreddykopparthi/.cache/huggingface/hub/models--lmsys--fastchat-t5-3b-v1.0/snapshots/0b1da230a891854102d749b93f7ddf1f18a81024/config.json
/Users/saivenkatreddykopparthi/.cache/huggingface/hub/models--lmsys--fastchat-t5-3b-v1.0/snapshots/0b1da230a891854102d749b93f7ddf1f18a81024/generation_config.json
/Users/saivenkatreddykopparthi/.cache/huggingface/hub/models--lmsys--fastchat-t5-3b-v1.0/snapshots/0b1da230a891854102d749b93f7ddf1f18a81024/special_tokens_map.json
/Users/saivenkatreddykopparthi/.cache/huggingface/hub/models--lmsys--fastchat-t5-3b-v1.0/snapshots/0b1da230a891854102d749b93f7ddf1f18a81024/spiece.model
/Users/saivenkatreddykopparthi/.cache/huggingface/hu

All the models are downloaded to Home/ .cache folder

In [5]:
# For Lamma
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

tokenizer = AutoTokenizer.from_pretrained(model_id, legacy=False)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Set pad_token_id to eos_token_id for compatibility
tokenizer.pad_token_id = tokenizer.eos_token_id

pipeline = pipeline("text-generation", model=model, device=-1, tokenizer=tokenizer, max_length=1000)

Loading checkpoint shards: 100%|██████████| 4/4 [01:00<00:00, 15.11s/it]


Those models are pulled automatically from the .cache folder from the above and below code.

In [5]:
# for lmsys/fastchat-t5-3b-v1.0
from transformers import AutoTokenizer, pipeline, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained(model_id, legacy=False, use_fast=False)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

pipeline = pipeline("text2text-generation", model=model, device=-1, tokenizer=tokenizer, max_length=1000)

In Hugging Face's `pipeline` function, the `device` parameter specifies where the model should run (e.g., on a CPU or GPU). Here's what `device=-1` means:

### **Explanation of `device=-1`**
- `device=-1`: Tells the pipeline to use the **CPU** for inference.
- `device=0` (or any positive integer): Specifies the **GPU** to use for inference. For example:
  - `device=0`: Use the first GPU.
  - `device=1`: Use the second GPU (if multiple GPUs are available).

### **Why Use `device=-1`?**
- **When a GPU is unavailable**: If your system doesn't have a GPU or CUDA isn't set up, setting `device=-1` ensures the model runs on the CPU.
- **Testing or development**: CPU is often used for small-scale testing or development since it's simpler to set up and doesn't require GPU-specific configurations.

In [None]:
pipeline("")

[{'generated_text': '-'}]

In [20]:
pipeline("""My name is Mark.
I have brothers called David and John and my best friend is Michael.
Using only the context above. Do you know if I have a sister?    
""")

[{'generated_text': '<pad> No,  I  do  not  have  a  sister.\n'}]