### Importing the required libraries

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import json
import torch
import transformers

  from .autonotebook import tqdm as notebook_tqdm


### Now we need to load the config.json where we have the TOKEN from Huggingface 

In [2]:
config_data = json.load(open('config.json'))
HF_TOKEN = config_data['HF_TOKEN']

### Indicate the model you want to use
In this case, we are using the Llama3 model from Meta, we can found it in the Hugging Face model hub [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)

In [3]:
model_id = 'meta-llama/Meta-Llama-3-8B-Instruct'

### Now we need to configure the quantization settings

In [4]:
### If need quantization, set the following config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=torch.bfloat16
)

### Creating tokenizer and model

In the context of language models like Llama3, `tokenizer` is an instance of a tokenizer class that is used to convert text into a format that the model can understand.

`tokenizer.pad_token` and `tokenizer.eos_token` are special tokens used in the tokenization process:

- `pad_token`: This is the token used for padding. Padding is necessary because when processing multiple sequences in a batch, all sequences need to be the same length. If a sequence is shorter than others, it's filled (or "padded") with this token to match the length of the longest sequence.

- `eos_token`: This is the "end of sequence" token. It's used to signal the end of a sequence. This is particularly important in tasks like language generation where the model needs to know when to stop generating text.

The line `tokenizer.pad_token = tokenizer.eos_token` is setting the padding token to be the same as the end of sequence token. This might be done for a variety of reasons, depending on the specifics of the task and the model. For example, in some tasks, it might be beneficial to treat padded areas of the sequence as though they are the end of the sequence.

In [5]:
# Tokenizer using quantization
tokenizer = AutoTokenizer.from_pretrained(model_id, token=HF_TOKEN, config=bnb_config)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
#tokenizer.pad_token = tokenizer.eos_token

#### Here we create the model pre-trained (this takes a while)

In [6]:
#model = AutoModelForCausalLM.from_pretrained(model_id, device_map='auto', token=HF_TOKEN)
# If you want to use quantization, use the following line instead
model = AutoModelForCausalLM.from_pretrained(model_id, device_map='auto', quantization_config=bnb_config, token=HF_TOKEN)

Loading checkpoint shards: 100%|██████████| 4/4 [02:45<00:00, 41.39s/it]


#### Now need to save the model and tokenizer to the desired location

In [8]:
# Save the model
model.save_pretrained('LLAMA-3-8B-Instruct-Quant')

In [9]:
# Save the tokenizer
tokenizer.save_pretrained('LLAMA-3-8B-Instruct-tokenizer-Quant')

('LLAMA-3-8B-Instruct-tokenizer-Quant/tokenizer_config.json',
 'LLAMA-3-8B-Instruct-tokenizer-Quant/special_tokens_map.json',
 'LLAMA-3-8B-Instruct-tokenizer-Quant/tokenizer.json')

### Loading the model
Now we have model files in the 'LLAMA-3-8B-Instruct' and his tokenizer directory at local. Load them to use the model.

In [10]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import json
import torch
import transformers

In [11]:
config_data = json.load(open('config.json'))
HF_TOKEN = config_data['HF_TOKEN']

In [12]:
# Empty the cache
import gc
gc.collect()
torch.cuda.empty_cache()

In [15]:
model_id = 'LLAMA-3-8B-Instruct-Quant'
tokenizer_id = 'LLAMA-3-8B-Instruct-tokenizer-Quant'

In [16]:
tokenizer = AutoTokenizer.from_pretrained(tokenizer_id, token=HF_TOKEN)
tokenizer.pad_token = tokenizer.eos_token

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [17]:
# Load the model and tokenizer that we just saved
model = AutoModelForCausalLM.from_pretrained(model_id, device_map='auto', token=HF_TOKEN)

Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.20it/s]


In [60]:
# Define the pipeline for text generation
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto",
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

sys_content = "You are a an assitant"
user_content = None

messages = [
    {"role": "system", "content": sys_content},
    {"role": "user", "content": user_content},
]


#### Now you can use the pipeline to generate text

In [61]:
user_content = "What is the weather like today?"

messages = [
    {"role": "system", "content": sys_content},
    {"role": "user", "content": user_content},
]

print(messages)

outputs = pipeline(
    messages,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
print(outputs[0]["generated_text"][-1]['content'])

[{'role': 'system', 'content': 'You are a an assitant'}, {'role': 'user', 'content': 'What is the weather like today?'}]
According to the latest reports, it's a beautiful day out there! The weather is mostly sunny with a high of 72°F (22°C) and a gentle breeze of 5mph (8km/h). There's a slight chance of scattered clouds later in the afternoon, but overall, it's looking like a perfect day to get outside and enjoy the sunshine!


In [62]:
# Function to generate text
def generate_text(user_content):
    messages = [
        {"role": "system", "content": sys_content},
        {"role": "user", "content": user_content},
    ]
    outputs = pipeline(
        messages,
        max_new_tokens=256,
        eos_token_id=terminators,
        do_sample=True,
        temperature=0.6,
        top_p=0.9,
    )
    return outputs[0]["generated_text"][-1]['content']

In [63]:
print(generate_text("What is the weather like today?"))

I can check that for you! According to the latest forecast, it's a beautiful day today with plenty of sunshine and a high of 75°F (24°C). There's a gentle breeze blowing at about 10 mph (16 km/h), making it a perfect day to get outside and enjoy the fresh air. However, please note that weather conditions can change rapidly, so it's always a good idea to check the forecast again before heading out. Would you like me to check the weather for a specific location or provide more details about the forecast?


In [64]:
generate_text("How creat a new user in linux")

'To create a new user in Linux, you can use the `useradd` command. Here\'s the basic syntax:\n\n`useradd [options] username`\n\nHere, `username` is the name you want to give to the new user, and `options` are optional arguments that you can use to customize the user creation process.\n\nHere are some common options you can use with `useradd`:\n\n* `-m` or `--create-home`: Create a home directory for the new user.\n* `-s` or `--shell`: Specify the default shell for the new user.\n* `-p` or `--password`: Set the password for the new user.\n* `-g` or `--group`: Specify the group that the new user should belong to.\n\nFor example, to create a new user named "john" with a home directory and a default shell, you can use the following command:\n\n`useradd -m john`\n\nThis will create a new user named "john" with a home directory and a default shell. You can then set the password for the new user using the `passwd` command:\n\n`passwd john`\n\nYou can also use the `useradd` command with option

In [98]:
sys_content = \
"""
You are an API, you will receive a message and must respond with an array with each word in the message.
You must write only the array of words or numbers if they are separated by spaces
"""
user_content = "Hello, how are you 333?"

array = generate_text(user_content)

In [99]:
print(array, "this is the array generated by the model")
print(type(array))

["Hello", "how", "are", "you", "333"] this is the array generated by the model
<class 'str'>


In [100]:
import json

s = array
array = json.loads(s)

print(array)

['Hello', 'how', 'are', 'you', '333']


In [106]:
int(array[-1])*2

666

In [107]:
for word in array:
    print(word)

Hello
how
are
you
333
