# Introduction to Hugging Face: Understanding Model Loading and Inference

<img src="https://huggingface.co/datasets/huggingface/brand-assets/resolve/main/hf-logo-with-title.png" alt="Alt Text" style="width: 400px;"/>


Welcome to our workshop on the fundamentals of Hugging Face, a pivotal platform in the world of Natural Language Processing (NLP) and Machine Learning (ML). This session is designed to guide you through the basic mechanics of Hugging Face, particularly focusing on loading models and performing inference. We will explore both direct methods using the `generate` function from loaded models and the streamlined approach offered by Hugging Face pipelines.

## What is Hugging Face?

Hugging Face is a revolutionary platform in the AI community, best known for its Transformers library. It provides an extensive collection of pre-trained models that are crucial for a variety of NLP tasks such as text classification, question answering, text generation, and more. These models, built on the transformer architecture, have transformed the way we approach language understanding and generation in AI.

## The Value of Hugging Face

The true value of Hugging Face lies in its accessibility and ease of use. With just a few lines of code, developers can leverage complex, state-of-the-art models for their NLP needs. This ease of access democratizes cutting-edge AI technology, making it available to a broader range of users and developers. Whether it's fine-tuning models on specific datasets or quickly deploying pre-trained models for inference, Hugging Face simplifies these processes remarkably.

## Why Learn Hugging Face?

For developers venturing into NLP and ML, Hugging Face is an indispensable tool. It not only accelerates the development process but also provides a platform for experimentation and learning. Understanding how to effectively load models and perform inference is crucial for building robust AI applications. As the field of NLP continues to evolve, proficiency in Hugging Face will become increasingly valuable.

## Foundation for the Workshop

The skills and concepts we develop here form the foundation for more advanced topics in our workshop. By mastering model loading and inference, you'll be well-prepared to tackle more complex tasks, experiment with different models, and appreciate the nuances of model performance and adaptation. Let's embark on this journey to unlock the full potential of Hugging Face in AI development.


#### Environment Setup and Dependency Installation

This cell is crucial for setting up the environment and installing necessary dependencies. We install specific versions of the `transformers` and `torch` libraries. These libraries are essential for leveraging the full capabilities of Intel hardware and optimizing model performance.

In [1]:
!source /opt/intel/oneapi/setvars.sh #comment out if not running on Intel Developer Cloud Jupyter
!pip install git+https://github.com/huggingface/transformers
!pip install torch==2.1.0

 
   To force a re-execution of setvars.sh, use the '--force' option.
   Using '--force' can result in excessive use of your environment variables.
  
usage: source setvars.sh [--force] [--config=file] [--help] [...]
  --force        Force setvars.sh to re-run, doing so may overload environment.
  --config=file  Customize env vars using a setvars.sh configuration file.
  --help         Display this help message and exit.
  ...            Additional args are passed to individual env/vars.sh scripts
                 and should follow this script's arguments.
  
  Some POSIX shells do not accept command-line options. In that case, you can pass
  command-line options via the SETVARS_ARGS environment variable. For example:
  
  $ SETVARS_ARGS="ia32 --config=config.txt" ; export SETVARS_ARGS
  $ . path/to/setvars.sh
  
  The SETVARS_ARGS environment variable is cleared on exiting setvars.sh.
  
Defaulting to user installation because normal site-packages is not writeable
Collecting git+https

#### Model Loading
Here, we load the model and tokenizer from Hugging Face. We're using `AutoModelForCausalLM` and `AutoTokenizer` from the `transformers` library to load "stabilityai/stablelm-2-1_6b", a causal language model. The `trust_remote_code` flag is set to `True` for remote code execution, and `torch_dtype` is set to "auto" for automatic precision setting, optimizing performance on the underlying hardware..


In [2]:
# Load model directly
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-2-1_6b", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
  "stabilityai/stablelm-2-1_6b",
  trust_remote_code=True,
  torch_dtype="auto",
)

#### Generating Text
In this cell, we perform text generation. We first tokenize a prompt ("I love learning to code...") and then use the `generate` method of our model to create new text. The generation parameters like `max_new_tokens`, `temperature`, and `top_p` are configured to control the creativity and coherence of the generated text. The result is then decoded and printed, showing the model's response to our prompt.
.


In [3]:
inputs = tokenizer("I love learning to code...", return_tensors="pt").to(model.device)
tokens = model.generate(
  **inputs,
  max_new_tokens=64,
  temperature=0.70,
  top_p=0.95,
  do_sample=True,
)
print(tokenizer.decode(tokens[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:100257 for open-end generation.


I love learning to code... and I love being able to teach others how to code. I've been teaching people how to code for almost 2 years now. The best part is that it's completely free and you don't need to have any previous programming experience to start learning.
If you're interested in learning to code, I've created a


#### Setting Up a Pipeline for Text Generation
This cell demonstrates setting up a Hugging Face pipeline for text generation. We initialize a `pipeline` object with our model and tokenizer, configuring it specifically for text generation. Parameters like `torch_dtype` for precision and `device_map` for device allocation are set for optimal performance. We then use this pipeline to generate a sequence from a given prompt, showcasing an alternative, more streamlined approach to model inference..


In [4]:
from transformers import pipeline
import torch

generator = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
        device_map="auto",
    )

sequences = generator( 
            "I love learning to code...",
            max_length=64,
            do_sample=True,
            top_k=3,
            temperature=0.70,
            top_p=0.95,
            num_return_sequences=1,
            eos_token_id=tokenizer.eos_token_id,)

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:100257 for open-end generation.


#### Displaying Pipeline Results
In the final cell, we iterate through the sequences generated by our pipeline and print them out. This step is crucial for visualizing the outputs of our text generation pipeline, allowing us to evaluate the model's performance in generating coherent and contextually relevant text..


In [6]:
for seq in sequences:
         print(f"Result: {seq['generated_text']}")

Result: I love learning to code... but I'm not a programmer. I'm not a designer. I'm not a developer. I'm not a business owner. I'm not a marketer. I'm not a salesperson. I'm not a writer. I'm not a manager.
I'm a marketer. I'm


# Conclusion and Discussion

#### Conclusion

Throughout this workshop, we've explored the basic mechanics of using Hugging Face, focusing on loading models, performing inference, and utilizing pipelines for efficient text generation. We've leveraged the power of Intel hardware optimizations and the versatility of Hugging Face's Transformers library to achieve high-performance model inference.

#### Discussion

The exercises demonstrated not only the ease of using pre-trained models from Hugging Face but also the importance of understanding the underlying hardware optimizations. Developers should now have a foundational understanding of model loading, text generation, and the utilization of pipelines in Hugging Face, which are crucial for developing efficient NLP applications. This knowledge serves as a stepping stone to more advanced topics in AI and ML, enabling developers to harness the full potential of modern NLP technologies.