<a href="https://colab.research.google.com/github/royam0820/HuggingFace/blob/main/amr_AutoTrain_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# loading previous run fine-tuning training for running inference
!unzip /content/drive/MyDrive/llm_tuning/llama2-MJ-prompts.zip

In [2]:
#@title 🤗 AutoTrain LLM
#@markdown In order to use this colab
#@markdown - upload train.csv to a folder named `data/`
#@markdown - train.csv must contain a `text` column
#@markdown - choose a project name if you wish
#@markdown - change model if you wish, you can use most of the text-generation models from Hugging Face Hub
#@markdown - add huggingface information (token and repo_id) if you wish to push trained model to huggingface hub
#@markdown - update hyperparameters if you wish
#@markdown - click `Runtime > Run all` or run each cell individually

import os
!pip install -U autotrain-advanced > install_logs.txt
!autotrain setup > setup_logs.txt

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
lida 0.0.10 requires kaleido, which is not installed.
llmx 0.0.15a0 requires cohere, which is not installed.
llmx 0.0.15a0 requires openai, which is not installed.
tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobuf 4.23.4 which is incompatible.
tensorflow-probability 0.22.0 requires typing-extensions<4.6.0, but you have typing-extensions 4.8.0 which is incompatible.[0m[31m
[0m> [1mINFO    Installing latest transformers@main[0m
> [1mINFO    Successfully installed latest transformers[0m
> [1mINFO    Installing latest peft@main[0m
> [1mINFO    Successfully installed latest peft[0m
> [1mINFO    Installing latest diffusers@main[0m
> [1mINFO    Successfully installed latest diffusers[0m
> [1mINFO    Installing latest trl@main[0m
> [1mINFO    Successfully installed la

In [3]:
#@markdown ---
#@markdown #### Project Config
#@markdown Note: if you are using a restricted/private model, you need to enter your Hugging Face token in the next step.
project_name = 'llama2_MJ_Project-v2' # @param {type:"string"}
model_name = 'abhishek/llama-2-7b-hf-small-shards' # @param {type:"string"}

#@markdown ---
#@markdown #### Push to Hub?
#@markdown Use these only if you want to push your trained model to a private repo in your Hugging Face Account
#@markdown If you dont use these, the model will be saved in Google Colab and you are required to download it manually.
#@markdown Please enter your Hugging Face write token. The trained model will be saved to your Hugging Face account.
#@markdown You can find your token here: https://huggingface.co/settings/tokens
push_to_hub = True # @param ["False", "True"] {type:"raw"}
hf_token = "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" #@param {type:"string"}
repo_id = "royam0820/llama2_MJ_Project-v2" #@param {type:"string"}

#@markdown ---
#@markdown #### Hyperparameters
learning_rate = 2e-4 # @param {type:"number"}
num_epochs = 3 #@param {type:"number"}
batch_size = 4 # @param {type:"slider", min:1, max:32, step:1}
block_size = 1024 # @param {type:"number"}
trainer = "sft" # @param ["default", "sft"] {type:"raw"}
warmup_ratio = 0.1 # @param {type:"number"}
weight_decay = 0.01 # @param {type:"number"}
gradient_accumulation = 4 # @param {type:"number"}
use_fp16 = True # @param ["False", "True"] {type:"raw"}
use_peft = True # @param ["False", "True"] {type:"raw"}
use_int4 = True # @param ["False", "True"] {type:"raw"}
lora_r = 16 #@param {type:"number"}
lora_alpha = 32 #@param {type:"number"}
lora_dropout = 0.05 #@param {type:"number"}

os.environ["PROJECT_NAME"] = project_name
os.environ["MODEL_NAME"] = model_name
os.environ["PUSH_TO_HUB"] = str(push_to_hub)
os.environ["HF_TOKEN"] = hf_token
os.environ["REPO_ID"] = repo_id
os.environ["LEARNING_RATE"] = str(learning_rate)
os.environ["NUM_EPOCHS"] = str(num_epochs)
os.environ["BATCH_SIZE"] = str(batch_size)
os.environ["BLOCK_SIZE"] = str(block_size)
os.environ["WARMUP_RATIO"] = str(warmup_ratio)
os.environ["WEIGHT_DECAY"] = str(weight_decay)
os.environ["GRADIENT_ACCUMULATION"] = str(gradient_accumulation)
os.environ["USE_FP16"] = str(use_fp16)
os.environ["USE_PEFT"] = str(use_peft)
os.environ["USE_INT4"] = str(use_int4)
os.environ["LORA_R"] = str(lora_r)
os.environ["LORA_ALPHA"] = str(lora_alpha)
os.environ["LORA_DROPOUT"] = str(lora_dropout)


In [4]:
!autotrain llm \
--train \
--model ${MODEL_NAME} \
--project-name ${PROJECT_NAME} \
--data-path . \
--text-column text \
--lr ${LEARNING_RATE} \
--batch-size ${BATCH_SIZE} \
--epochs ${NUM_EPOCHS} \
--block-size ${BLOCK_SIZE} \
--warmup-ratio ${WARMUP_RATIO} \
--lora-r ${LORA_R} \
--lora-alpha ${LORA_ALPHA} \
--lora-dropout ${LORA_DROPOUT} \
--weight-decay ${WEIGHT_DECAY} \
--gradient-accumulation ${GRADIENT_ACCUMULATION} \
$( [[ "$USE_FP16" == "True" ]] && echo "--fp16" ) \
$( [[ "$USE_PEFT" == "True" ]] && echo "--use-peft" ) \
$( [[ "$USE_INT4" == "True" ]] && echo "--use-int4" ) \
$( [[ "$PUSH_TO_HUB" == "True" ]] && echo "--push-to-hub --token ${HF_TOKEN} --repo-id ${REPO_ID}" )

> [1mINFO    Running LLM[0m
> [1mINFO    Params: Namespace(version=False, train=True, deploy=False, inference=False, data_path='.', train_split='train', valid_split=None, text_column='text', rejected_text_column='rejected', model='abhishek/llama-2-7b-hf-small-shards', learning_rate=0.0002, num_train_epochs=3, train_batch_size=4, warmup_ratio=0.1, gradient_accumulation_steps=4, optimizer='adamw_torch', scheduler='linear', weight_decay=0.01, max_grad_norm=1.0, seed=42, add_eos_token=False, block_size=1024, use_peft=True, lora_r=16, lora_alpha=32, lora_dropout=0.05, logging_steps=-1, project_name='llama2_MJ_Project-v2', evaluation_strategy='epoch', save_total_limit=1, save_strategy='epoch', auto_find_batch_size=False, fp16=True, push_to_hub=True, use_int8=False, model_max_length=1024, repo_id='royam0820/llama2_MJ_Project-v2', use_int4=True, trainer='default', target_modules=None, merge_adapter=False, token='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', backend='default', username=None, use_f

[My HF repository](https://huggingface.co/royam0820)

# Inference


In [5]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.nn import DataParallel # for multiple GPUs

In [7]:
tokenizer = AutoTokenizer.from_pretrained("/content/llama2_MJ_Project-v2")
model = AutoModelForCausalLM.from_pretrained("/content/llama2_MJ_Project-v2")

Loading checkpoint shards:   0%|          | 0/10 [00:00<?, ?it/s]



In [17]:
input_context = '''
"###Human:
generate a midjourney prompt for a child running in the rain, give a detailed description
####Assistant:
'''

In [18]:
input_ids = tokenizer.encode(input_context, return_tensors='pt')

In [19]:
output = model.generate(input_ids, max_length=100, temperature=0.3, num_return_sequences=1, do_sample=True)

In [15]:
generated_text = tokenizer.decode(output[0], skip_special_token=True)
print(generated_text)

<s> 
"###Human:
generate a midjourney prompt for a child running in the rain, give a detailed description
####Assistant:
The child is running in the rain, and the rain is falling heavily. The child is wearing a yellow raincoat and a pair of red rain boots. The child is running as fast as they can, trying to outrun the rain. The child is laughing and having fun, but they are also getting


In [None]:
# # amr test
# output = model.generate(
#          input_ids=input_ids,
#          #attention_mask=attention_mask,
#          pad_token_id= tokenizer.eos_token_id,
#          eos_token_id= tokenizer.eos_token_id,
#          max_new_tokens=500, do_sample=False,
#          top_k=30, top_p=0.85, temperature=0.3, repetition_penalty=1.2)

In [21]:
%%time
inputs = tokenizer(input_context, return_tensors="pt")
outputs = model.generate(**inputs, num_beams=4, do_sample=True, max_length= 100, temperature=0.3, repetition_penalty=1.2)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))



"###Human:
generate a midjourney prompt for a child running in the rain, give a detailed description
####Assistant:
The child is running in the rain, with a smile on her face. She is wearing a yellow raincoat and has a red umbrella in her hand. The rain is pouring down, and the child is laughing as she runs through the puddles. She looks like she is having the
CPU times: user 8min 13s, sys: 3.18 s, total: 8min 16s
Wall time: 2min 4s


# Web Interface

In [None]:
# # to fix the absence of utf-8 locale on Colab
# import locale
# def getpreferredencoding(do_setlocale = True):
#     return "UTF-8"
# locale.getpreferredencoding = getpreferredencoding

In [None]:
# logging to the HF hub to get access to the authentication token
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
!pip install -qU gradio==3.41.0

In [None]:
import gradio as gr

In [None]:
from torch import cuda, bfloat16
from transformers import AutoTokenizer,AutoModelForCausalLM,AutoConfig,BitsAndBytesConfig,pipeline
from peft import AutoPeftModelForCausalLM

model_name = "royam0820/MJ_prompts"
model = AutoPeftModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type='nf4',
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=bfloat16),
    device_map='cuda:0',
)
# enable model inference
model.eval()

tokenizer = AutoTokenizer.from_pretrained(model_name)

pipeline = pipeline(
    task='text-generation',
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.1,  # creativity of responses: 0.0 none ->  1.0 max
    repetition_penalty=1.1  # to avoid repeating output
)


Downloading (…)/adapter_config.json:   0%|          | 0.00/503 [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/10 [00:00<?, ?it/s]



Downloading adapter_model.bin:   0%|          | 0.00/33.6M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/984 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PersimmonForCausalLM', 'PLBartFo

In [None]:
def predict(prompt):

  sequences = pipeline(
    prompt,
    do_sample=True,
    temperature=0.2,
    top_p=0.9,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=100,
  )

  response = ''
  for seq in sequences:
    response += seq['generated_text']

  return response

NB: The inference code is the same as before, but here we are including our pipeline, configured as usual, inside a function so that it can be called at will to pass us the user’s request via the prompt parameter.

In [None]:
demo = gr.Interface(
  # call to the function predict
  fn=predict,
  # text boxes for input and output
  inputs=gr.Textbox(label="Please, write your request here:", placeholder="example: def fibonacci(", lines=5),
  outputs=gr.Textbox(label="Answer (inference):"),
  # web interface title
  title='AutoTrain LLM Tuning',
  description='The answers given are from a fine-tuned Llama2 7B for MJ prompts ',
  article='Nicola Santi https://medium.com',
  # list of ready made examples
  examples=[["def Fibonacci("], ["function DotProduct("], ['springboot profile'], ['write a class for manage shipment']],
  allow_flagging="never"
)

demo.launch(share=True, debug=True)

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://fb9bb23a4c80f3b606.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


Both `max_new_tokens` (=512) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=512) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=512) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=512) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


NB: Gradio configuration

There are three main configurations, and they concern:
- fn the interfacing with the underlying model which is done via our function;
- input where we customize the text area where we insert the prompt by specifying its title, internal hint and length;
- output for the text area of the ai responses, we simply configure the title;

The other configurations allow us to specify a title, an html description, and the list of ready-made examples.

`allow_flagging`: When a user testing your model sees input with interesting output, they can click the flag button to send the input and output data back to the machine where the demo is running. The sample is saved to a CSV log file (by default). If the demo involves images, audio, video, or other types of files, these are saved separately in a parallel directory and the paths to these files are saved in the CSV file. With the value `never` the users will not see a button to flag, and no sample will be flagged.

Finally, and we have thus come to the end of this presentation, the last command launches the application by returning to the console the address to connect to with the browser to test the application.