### Learning Transformers

- Reference: https://huggingface.co/docs/transformers/quicktour?installation=PyTorch

- Transformers is designed to be fast and easy to use so that everyone can start learning or building with transformer models.

    - The number of user-facing abstractions is limited to only THREE classes for instantiating a model, and TWO APIs for inference or training. This quickstart introduces you to Transformers’ key features and shows you how to:

    - load a pretrained model
    - run inference with Pipeline
    - fine-tune a model with Trainer

## PRETRAIN MODELS

- Each pretrained model inherits from three base classes.

#### PretrainedConfig	
     - A file that specifies a models attributes such as the number of attention heads or vocabulary size.
#### PreTrainModel
    - A model (or architecture) defined by the model attributes from the configuration file. A pretrained model only returns the raw hidden states. For a specific task, use the appropriate model head to convert the raw hidden states into a meaningful result (for example, LlamaModel versus LlamaForCausalLM).
#### PreProcessor
    - A class for converting raw inputs (text, images, audio, multimodal) into numerical inputs to the model. For example, PreTrainedTokenizer converts text into tensors and ImageProcessingMixin converts pixels into tensors.

- Use the AutoClass API to load models and preprocessors because it automatically infers the appropriate architecture for each task and machine learning framework based on the name or path to the pretrained weights and configuration file.

- Use from_pretrained() to load the weights and configuration file from the Hub into the model and preprocessor class.

### PyTorch

- When you load a model, configure the following parameters to ensure the model is optimally loaded.

    - device_map="auto" automatically allocates the model weights to your fastest device first, which is typically the GPU.
    - torch_dtype="auto" directly initializes the model weights in the data type they’re stored in, which can help avoid loading the weights twice (PyTorch loads weights in torch.float32 by default).

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Create an object model 
model = AutoModelForCausalLM.from_pretrained("./SavedModels/llama3.1-Instruct", torch_dtype=torch.float16, device_map={"": "cuda:0"})
# Creat an object of a tokenizer
tokenizer = AutoTokenizer.from_pretrained("./SavedModels/llama3.1-Instruct")

Loading checkpoint shards: 100%|██████████| 4/4 [00:19<00:00,  4.96s/it]


- Tokenize the text and return PyTorch tensors with the tokenizer. Move the model to a GPU if it’s available to accelerate inference.

In [3]:
model_inputs = tokenizer(["The secret to become successful in Robotics industry is "], return_tensors="pt").to("cuda")


- The model is now ready for inference or training.

- For inference, pass the tokenized inputs to generate() to generate text. Decode the token ids back into text with batch_decode().

In [None]:
generated_ids = model.generate(**model_inputs, max_length=150) 
tokenizer.batch_decode(generated_ids)[0]
'''
 Took 4 mins and 19 seconds to generate and output for max length 150 on my 8Gb Vram : RTX 4070 laptop
 Moreover, it consumed all of my Vram when I just load the model in Cuda:0 device type
'''

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


'<|begin_of_text|>The secret to become successful in Robotics industry is 3 things. This is the opinion of many professionals in the field. They are:\n1. Passion\n2. Hard work\n3. Networking\nThese are the three key elements that will help you become successful in the Robotics industry. Passion will drive you to learn and improve continuously. Hard work will help you to apply your knowledge and skills to achieve your goals. Networking will help you to find opportunities, get advice and support from experienced professionals.\nThe Robotics industry is a highly competitive field, and it requires a lot of dedication and perseverance to succeed. However, with the right mindset and approach, it is possible to achieve great things and make a meaningful impact in this field.\nHere are some additional'

In [5]:
print(tokenizer.batch_decode(generated_ids)[0])

<|begin_of_text|>The secret to become successful in Robotics industry is 3 things. This is the opinion of many professionals in the field. They are:
1. Passion
2. Hard work
3. Networking
These are the three key elements that will help you become successful in the Robotics industry. Passion will drive you to learn and improve continuously. Hard work will help you to apply your knowledge and skills to achieve your goals. Networking will help you to find opportunities, get advice and support from experienced professionals.
The Robotics industry is a highly competitive field, and it requires a lot of dedication and perseverance to succeed. However, with the right mindset and approach, it is possible to achieve great things and make a meaningful impact in this field.
Here are some additional


In [None]:
model_inputs = tokenizer(["Can you elaborate on those 3 things? "], return_tensors="pt").to("cuda")
generated_ids = model.generate(**model_inputs, max_length=200)
print(tokenizer.batch_decode(generated_ids)[0])
'''
 Took 6 mins and 19 seconds to generate and output for max length 200 on my 8Gb Vram : RTX 4070 laptop
 Moreover, it consumed all of my Vram when I just load the model in Cuda:0 device type
 It didnot cache my previous inputs so it started from scratch every time.
'''

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|>Can you elaborate on those 3 things?  I'm not sure I fully understand what you're saying.
1. "You're trying to find the most efficient way to get from point A to point B."  I think I understand this part.  I'm trying to find the most efficient way to get from my current understanding of the universe to a deeper understanding of the universe.
2. "You're trying to find a new way to see the universe."  I'm not sure what this means.  Are you saying that I'm trying to find a new perspective or a new way of understanding the universe that isn't based on my current knowledge or experience?
3. "You're trying to find a new way to describe the universe."  I think I understand this part.  I'm trying to find new words or concepts to describe the universe, or to describe the things that I experience in the universe.

I think I understand what you're saying, but I'd like


In [None]:
model_inputs = tokenizer(["what is your name?"], return_tensors="pt").to("cuda")
generated_ids = model.generate(**model_inputs, max_length=10)
print(tokenizer.batch_decode(generated_ids)[0])
'''
 Took 7 seconds to generate and output for max length 10 on my 8Gb Vram : RTX 4070 laptop
 Moreover, it consumed all of my Vram when I just load the model in Cuda:0 device type
 It didnot cache my previous inputs so it started from scratch every time.
'''

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|>what is your name? i am a student


'\n Took 6 mins and 19 seconds to generate and output for max length 200 on my 8Gb Vram : RTX 4070 laptop\n Moreover, it consumed all of my Vram when I just load the model in Cuda:0 device type\n It didnot cache my previous inputs so it started from scratch every time.\n'

In [None]:
model_inputs = tokenizer(["The secret of staying happy is"], return_tensors="pt").to("cuda")
generated_ids = model.generate(**model_inputs, max_length=20)
print(tokenizer.batch_decode(generated_ids)[0])
'''
 Took 25 seconds to generate and output for max length 20 on my 8Gb Vram : RTX 4070 laptop

 Length of the response needed is directly proportional to the time taken to generate it.
'''

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|>The secret of staying happy is to find the joy in everyday moments. It’s a mindset,


'\n Took 7 seconds to generate and output for max length 10 on my 8Gb Vram : RTX 4070 laptop\n Moreover, it consumed all of my Vram when I just load the model in Cuda:0 device type\n It didnot cache my previous inputs so it started from scratch every time.\n'

# Pipeline
- The Pipeline class is the most convenient way to inference with a pretrained model. It supports many tasks such as text generation, image segmentation, automatic speech recognition, document question answering, and more.
- Create a Pipeline object and select a task. By default, Pipeline downloads and caches a default pretrained model for a given task. Pass the model name to the model parameter to choose a specific model.

In [None]:
from transformers import pipeline

pipeline = pipeline("text-generation", model="./SavedModels/llama3.1-Instruct", device="cuda")
pipeline("The secret to get hired in FAANG companies is", max_length=10, truncation=True)

'''
 Took 9 mins and 3  seconds to generate and output for max length 256 on my 8Gb Vram : RTX 4070 laptop

 Length of the response needed is directly proportional to the time taken to generate it.
'''

  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00,  9.29it/s]
Device set to use cuda
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': "The secret to get hired in FAANG companies is not a secret. However, the path to get hired in these companies is often misunderstood. Here's what it takes to get hired in FAANG companies.\nThe FAANG companies (Facebook, Apple, Amazon, Netflix, and Google) are known for their rigorous hiring processes, and getting hired in these companies is often a challenging and competitive process. However, the secret to getting hired in these companies is not a secret. Here's what it takes to get hired in FAANG companies:\n1. **Develop in-demand skills**: FAANG companies are constantly looking for professionals with in-demand skills, such as machine learning, artificial intelligence, cloud computing, data science, cybersecurity, and software engineering. Stay up-to-date with industry trends and develop skills that are in high demand.\n2. **Gain relevant experience**: FAANG companies prefer candidates with relevant work experience, especially in the tech industry. Internships, c

In [None]:
from transformers import pipeline

pipeline = pipeline("text-generation", model="./SavedModels/llama3.1-Instruct", device="cpu")
pipeline("The secret is", max_length=10)

'''
 Took 2 mins and 43  seconds to generate and output for max length 256 on my 32 ram laptop when ran in CPU mode

 Length of the response needed is directly proportional to the time taken to generate it.
'''

Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00, 111.85it/s]
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=10) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': "The secret is out: the best way to get a good night's sleep is not to try to get a good night's sleep.\nIn other words, the more you worry about getting a good night's sleep, the less likely you are to actually get one. This is because sleep is a complex process that involves the interplay of multiple factors, including your brain's ability to regulate its own sleep-wake cycle, your physical environment, and your mental state.\nWhen you try to force yourself to sleep, you can create a kind of self-fulfilling prophecy that makes it harder to fall asleep. This is because the more you focus on falling asleep, the more anxious and alert you become, which can actually interfere with your ability to relax and fall asleep.\nSo, instead of trying to get a good night's sleep, try this: stop trying to get a good night's sleep. Yes, you read that right. Just let go of the expectation that you need to sleep well, and instead focus on creating a relaxing and calming environment