# __Please run the provided demo notebook file in Google Colab to explore the hands-on example.__

# **Description**
In this project, we will explore the concept of own LLM with its own instance Using Falcon-7b.

# **Prerequisites**
* Python programming skills
* Installation of LangChain and OpenAI Python libraries
* HuggingFace login and GPU Machine

# **Steps to perform:**
1. Set up your environment for the project
2. Load the dataset using hugging face
3. Create a large language model (LLM) for your project
4. Develop an LLM tokenizer for processing text
5. Save your trained model and generate inferences as needed


# **Step 1: Setup the environment**


*   Import the necessary libraries and set up the Hugging Face Token Access

In [None]:
#Logging to Hugging Face
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# **Step 2: Load the dataset from hugging face**

In [None]:
from datasets import load_dataset
billsum=load_dataset("billsum",split="train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/6.87k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/91.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/15.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/18949 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3269 [00:00<?, ? examples/s]

Generating ca_test split:   0%|          | 0/1237 [00:00<?, ? examples/s]

In [None]:
billsum

Dataset({
    features: ['text', 'summary', 'title'],
    num_rows: 18949
})

In [None]:
billsum[0]

{'text': "SECTION 1. LIABILITY OF BUSINESS ENTITIES PROVIDING USE OF FACILITIES \n              TO NONPROFIT ORGANIZATIONS.\n\n    (a) Definitions.--In this section:\n            (1) Business entity.--The term ``business entity'' means a \n        firm, corporation, association, partnership, consortium, joint \n        venture, or other form of enterprise.\n            (2) Facility.--The term ``facility'' means any real \n        property, including any building, improvement, or appurtenance.\n            (3) Gross negligence.--The term ``gross negligence'' means \n        voluntary and conscious conduct by a person with knowledge (at \n        the time of the conduct) that the conduct is likely to be \n        harmful to the health or well-being of another person.\n            (4) Intentional misconduct.--The term ``intentional \n        misconduct'' means conduct by a person with knowledge (at the \n        time of the conduct) that the conduct is harmful to the health \n        or w

In [None]:
def format_input(example: dict) ->str:
  example['formated_text'] = f'''### Human: summarize given text: {example["text"]}
  ### Assistant : {example["summary"]}'''
  return example

In [None]:
billsum=billsum.map(format_input)

In [None]:
billsum

Dataset({
    features: ['text', 'summary', 'title', 'formated_text'],
    num_rows: 18949
})

In [None]:
billsum_sampled=billsum.shuffle(seed=12).select(range(100))

In [None]:
pip install accelerate



# **Step 3: Create LLM model**


*   Initialize the model, specifying the model name, and loading a pre-trained model - ybelkada/falcon-7b-sharded-bf16

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer
from peft import prepare_model_for_kbit_training

model_name = "ybelkada/falcon-7b-sharded-bf16"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
model.config.use_cache = False
model.gradient_checkpointing_enable()
model =prepare_model_for_kbit_training(model)


pytorch_model.bin.index.json:   0%|          | 0.00/16.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

pytorch_model-00001-of-00008.bin:   0%|          | 0.00/1.92G [00:00<?, ?B/s]

pytorch_model-00002-of-00008.bin:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

pytorch_model-00003-of-00008.bin:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

pytorch_model-00004-of-00008.bin:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

pytorch_model-00005-of-00008.bin:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

pytorch_model-00006-of-00008.bin:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

pytorch_model-00007-of-00008.bin:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

pytorch_model-00008-of-00008.bin:   0%|          | 0.00/921M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [None]:
model

FalconForCausalLM(
  (transformer): FalconModel(
    (word_embeddings): Embedding(65024, 4544)
    (h): ModuleList(
      (0-31): 32 x FalconDecoderLayer(
        (self_attention): FalconAttention(
          (maybe_rotary): FalconRotaryEmbedding()
          (query_key_value): Linear4bit(in_features=4544, out_features=4672, bias=False)
          (dense): Linear4bit(in_features=4544, out_features=4544, bias=False)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): FalconMLP(
          (dense_h_to_4h): Linear4bit(in_features=4544, out_features=18176, bias=False)
          (act): GELU(approximate='none')
          (dense_4h_to_h): Linear4bit(in_features=18176, out_features=4544, bias=False)
        )
        (input_layernorm): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
      )
    )
    (ln_f): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=4544, out_features=65024, bias=False)
)

# **Step 4: Create LLM tokenizer**


*   Initialize a trainer **SFTTrainer** with the specified model, dataset, and training configurations, then starts the training process

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/180 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

In [None]:
from peft import LoraConfig, TaskType

peft_config = LoraConfig(
    r=16, #attention heads
    lora_alpha=32, #alpha scaling
    # target_modules=["q_proj", "v_proj"], #if you know the
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
    target_modules=["query_key_value"]# set this for CLM or Seq2Seq
)

In [None]:
from peft import get_peft_model
model = get_peft_model(model, peft_config)

In [None]:
from transformers import TrainingArguments

training_arguments = TrainingArguments(
    output_dir='./training_result',
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    optim='paged_adamw_8bit',
    learning_rate=2e-4,
    fp16=True,
    warmup_ratio=0.05,
    group_by_length=True,
    lr_scheduler_type="cosine"
)

In [None]:
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    train_dataset=billsum_sampled,
    peft_config=peft_config,
    dataset_text_field="formated_text",
    max_seq_length=1024,
    tokenizer=tokenizer,
    args=training_arguments,
)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [None]:
trainer.train()

You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
The current implementation of Falcon calls `torch.scaled_dot_product_attention` directly, this will be deprecated in the future in favor of the `BetterTransformer` API. Please install the latest optimum library with `pip install -U optimum` and call `model.to_bettertransformer()` to benefit from `torch.scaled_dot_product_attention` and future performance optimizations.


Step,Training Loss


TrainOutput(global_step=25, training_loss=1.2728195190429688, metrics={'train_runtime': 401.9924, 'train_samples_per_second': 0.249, 'train_steps_per_second': 0.062, 'total_flos': 4074068115456000.0, 'train_loss': 1.2728195190429688, 'epoch': 1.0})

# **Step 5 : Saving the model**



In [None]:
trained_model_dir='./trained_model'
model.save_pretrained(trained_model_dir)

In [None]:
from peft import PeftConfig, PeftModel

config = PeftConfig.from_pretrained(trained_model_dir)
trained_model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    return_dict=True,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

trained_model = PeftModel.from_pretrained(trained_model, trained_model_dir)

trained_model_tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
trained_model_tokenizer.pad_token = trained_model_tokenizer.eos_token

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

# **Step 5.1: Create generation config for prediction**

In [None]:
generation_config = trained_model.generation_config
generation_config.max_new_token = 1024
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequence= 1
generation_config.pad_token_id = trained_model_tokenizer.pad_token_id
generation_config.eos_token_id = trained_model_tokenizer.eos_token_id

In [None]:
generation_config

GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 11,
  "max_new_token": 1024,
  "num_return_sequence": 1,
  "pad_token_id": 11,
  "temperature": 0.7,
  "top_p": 0.7
}

# **Step 6 : Create model inference**

In [None]:
query ='''A new large language model (llm) called Falcon-7b was developed using stablize a text and code training dataset. one of the biggest llm ever made, it has 7 billion parameters, falcon 7b is capable of doing wide range of jobs, such as creating text, translating language'''

In [None]:
prompt = f'''### Human: Summarize the given text : {query}
### Assistant: '''

In [None]:
encodings = trained_model_tokenizer(prompt, return_tensors="pt").to("cuda:0")


In [None]:
encodings

{'input_ids': tensor([[19468,  6823,    37, 12753,   270,   907,   248,  2132,  2288,   204,
            37,   317,   627,  1902,  3599,  2308,   204,    19,   567,    88,
            20,  1964, 38257,    24,    34,    77,   398,  4027,  1241,   324,
         17474,   907,   241,  2288,   273,  2928,  2555, 20512,    25,   532,
           275,   248,  5270, 17826,    88,  1779,  1021,    23,   334,   504,
           204,    34,  4777,  9038,    23, 20519,  1043,   204,    34,    77,
           304,  7487,   275,  1836,  3436,  2393,   275,  4757,    23,   963,
           345,  3957,  2288,    23, 42721,  3599,   193, 19468, 12453,    37,
           204]], device='cuda:0'), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0'), '

In [None]:
outputs = trained_model.generate(
      input_ids = encodings.input_ids,
      attention_mask = encodings.attention_mask,
      generation_config = generation_config,
      max_new_token = 100
  )




In [None]:
outputs

tensor([[19468,  6823,    37, 12753,   270,   907,   248,  2132,  2288,   204,
            37,   317,   627,  1902,  3599,  2308,   204,    19,   567,    88,
            20,  1964, 38257,    24,    34,    77,   398,  4027,  1241,   324,
         17474,   907,   241,  2288,   273,  2928,  2555, 20512,    25,   532,
           275,   248,  5270, 17826,    88,  1779,  1021,    23,   334,   504,
           204,    34,  4777,  9038,    23, 20519,  1043,   204,    34,    77,
           304,  7487,   275,  1836,  3436,  2393,   275,  4757,    23,   963,
           345,  3957,  2288,    23, 42721,  3599,   193, 19468, 12453,    37,
           204,    13]], device='cuda:0')

In [None]:
outputs = trained_model_tokenizer.decode(outputs[0],skip_special_tokens=True)
outputs

'### Human: Summarize the given text : A new large language model (llm) called Falcon-7b was developed using stablize a text and code training dataset. one of the biggest llm ever made, it has 7 billion parameters, falcon 7b is capable of doing wide range of jobs, such as creating text, translating language\n### Assistant: "'

# **Conclusion**
By the end of this project, you will have learned how to use HuggingFace and falcon-  7b to generate creative text.
 The code involves initializing a pre-trained language model, preparing it for k-bit training, defining training configurations, conducting model training, saving the trained model, loading the trained model, and using it for text generation based on a given prompt.

This series of steps showcases the workflow from model preparation and training to utilizing the trained model for a specific task, in this case, generating text based on a prompt.