In [None]:
%%capture
!pip install transformers>=4.40.1 accelerate>=0.27.2

In [1]:
!pip install -U transformers accelerate

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.wh

In [2]:
import transformers
print(transformers.__version__)  # Should be 4.41.x or newer

4.52.4


In [6]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# AutoModelForCausalLM: A class for loading causal language models
# (models that generate text sequentially, like GPT).

# AutoTokenizer: A class for loading the tokenizer associated with the model
# (converts text to tokens and vice versa).


# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct", # Model name on Hugging Face Hub
    device_map="cuda", # Load model on GPU (CUDA)
    torch_dtype="auto", # Automatically select dtype (float16/float32)
    trust_remote_code=False,  # Allow executing custom code from the model repo (if needed)
)

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

# The tokenizer is loaded from the same model repository and handles:
# Text → Token conversion (for input).
# Token → Text conversion (for output).

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [8]:
# This code sets up a text-generation pipeline using Hugging Face's transformers library
from transformers import pipeline

# The pipeline() function is a high-level utility from Hugging Face
# that simplifies inference tasks
# (like text generation, translation, summarization, etc.).
# Here, we're creating a text-generation pipeline.

# Create a pipeline
generator = pipeline(
    "text-generation", # Task: Generate text
    model=model, # Specify model by name
    tokenizer=tokenizer, # No longer needed when specifying model by name
    return_full_text=True, # Include input + generated text in output
    max_new_tokens=100, # Max tokens to generate (longer = more output)
    #Limits the response to 500 new tokens
    do_sample=False  # Disable random sampling (deterministic output)
#    temperature=0.7,        # 0.1–1.0: Lower = more deterministic
)


# return_full_text=True
# If True: Output includes both the input prompt and generated text.
# If False: Only returns the newly generated text.


# do_sample=False
# If False: Uses greedy decoding (always picks the most likely next token → deterministic output).
# If True: Enables random sampling (creative but less predictable output, often used with temperature).


# Deterministic vs. Random Outputs:
# With do_sample=False, the same prompt will always produce the same output.
# For creativity, set do_sample=True and add temperature=0.7 (higher = more random).

Device set to use cuda
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


The pipeline simplifies the process of generating text by handling all the underlying steps automatically.

In [9]:
# The prompt (user input / query)
messages = [
    {"role": "user", "content": "Create a funny joke about chickens."}
]

# role: Can be "user", "assistant", or "system" (for instructions).

# Generate output
output = generator(messages)
print(output[0]["generated_text"])

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


[{'role': 'user', 'content': 'Create a funny joke about chickens.'}, {'role': 'assistant', 'content': ' Why did the chicken join the band? Because it had the drumsticks!'}]


In [10]:
messages = [
    {"role": "user", "content": "how to spend the weekend?"}
]

In [11]:
# Generate output
output = generator(messages)
print(output[0]["generated_text"])

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


[{'role': 'user', 'content': 'how to spend the weekend?'}, {'role': 'assistant', 'content': " Spending a weekend can be a great opportunity to relax, recharge, and engage in activities that you enjoy. Here are some suggestions for how to spend your weekend:\n\n1. Plan ahead: Decide on a general theme or type of activity you'd like to do. This could be anything from outdoor adventures, cultural experiences, or simply relaxing at home.\n\n2. Outdoor activities: If the weather permits, consider going for a hi"}]


In [12]:
output[0]["generated_text"][0]['content']

'how to spend the weekend?'

In [13]:
print(output[0]["generated_text"][1]['content'])

 Spending a weekend can be a great opportunity to relax, recharge, and engage in activities that you enjoy. Here are some suggestions for how to spend your weekend:

1. Plan ahead: Decide on a general theme or type of activity you'd like to do. This could be anything from outdoor adventures, cultural experiences, or simply relaxing at home.

2. Outdoor activities: If the weather permits, consider going for a hi


Key Takeaways

The Hugging Face Hub is a central repository for finding and downloading LLMs and other AI models.

The Transformers library simplifies the process of loading and using LLMs, with utilities like pipeline for text generation.

The Phi-3-mini model is a lightweight yet powerful generative model suitable for running on devices with limited resources.

Text generation involves two main components: the model and the tokenizer. The tokenizer breaks input text into tokens, and the model generates text based on those tokens.

The example demonstrates how to generate text using a simple prompt, resulting in a humorous output.

