# **LLM**
The Large Language Model (LLM) is a deep learning-based language model with large and complex parameters. This model goes through a pre-training stage on big and diverse data before being featured or finetuned to a specific task.

# **Fine-Tuning**
Fine-tuning is the process of taking a pre-trained machine learning model and retraining (tuning) it partially or completely on a smaller dataset and a more specific task. The aim is to improve the model's performance in more specific tasks by leveraging the knowledge gained during previous training. With fine-tuning, models can be adapted to better handle specific tasks without having to train from scratch, which often requires greater time and resources.

# **0.0 Using Hugging Face Dataset**
Fine-tuning a Language Model (LLM) involves the process of refining a pre-trained model using a specific dataset provided by Hugging Face. This dataset serves as additional training data, allowing the model to adapt and specialize its language understanding to the particular domain or task represented by the dataset.

In [None]:
#@title 0.1 Mount Drive

import os
from google.colab import drive

#os.mkdir("Drive")
drive.mount("Drive")

print("[INFO] Success mount drive at /content/Drive")

In [None]:
#@title 0.2 Install Transformers from GitHub
!git clone https://github.com/huggingface/transformers
%cd transformers
!python setup.py install
%cd /content/transformers/examples/pytorch/language-modeling

print("[INFO] Success build Transformers")

In [None]:
#@title 0.3 Fine-Tuning Model
#@markdown Choose a natural language processing pretrained model or enter your last checkpoint folder path (if you wanna resume training from checkpoint). It is recommended to use a model with small parameters if you are using a free GPU
MODEL = "openai-gpt" #@param ["openai-gpt", "distilgpt2", "gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl", "facebook/opt-125m", "facebook/opt-350m", "facebook/opt-2.7b", "facebook/xglm-564M", "facebook/xglm-7.5B", "EleutherAI/gpt-neo-125m", "EleutherAI/gpt-neo-1.3B", "EleutherAI/gpt-j-6b", "EleutherAI/pythia-14m", "EleutherAI/pythia-31m", "EleutherAI/pythia-70m", "EleutherAI/pythia-160m", "EleutherAI/pythia-410m" , "EleutherAI/pythia-1b", "roneneldan/TinyStories-1M", "roneneldan/TinyStories-3M", "roneneldan/TinyStories-8M", "roneneldan/TinyStories-33M"] {allow-input: true}
#@markdown Enter the model name for the output (without spaces, or replace using "-")
OUTPUT_NAME = "MikaGPT" #@param {type: "string"}
#@markdown Select the dataset you want to use
DATASET_NAME = "yahma/alpaca-cleaned" #@param ["wikitext:wikitext-103-raw-v1", "wikitext:wikitext-103-v1", "wikitext:wikitext-2-raw-v1", "wikitext:wikitext-2-v1", "yahma/alpaca-cleaned"]
#@markdown Enter a number for the number of steps for each stored checkpoint. Enter -1 if you want to save only the last checkpoint
SAVE_STEPS = 100 #@param {type: "integer"}
#@markdown Enter the batch size for each device. Batch size is the sample dataset that is given to the model in one iteration. The more batch sizes, the smaller the gradient variations and the greater the GPU usage. On the other hand, if it is too small, the gradient variation will increase but may cause convergence
BATCH_SIZE = 11 #@param {type:"slider", min:1, max:64, step:1}
#@markdown Enter the number of Epochs for the Fine-Tuning process. Epochs are the number of iterations of one dataset given to the model. The more, the process will take longer
TRAIN_EPOCHS = 3 #@param {type:"slider", min:1, max:100, step:1}
#@markdown Select if you want to continue model training from the checkpoint
RESUME_FROM_CHECKPOINT = False #@param {type:"boolean"}
#@markdown Enter additional parameters, leave blank if there are no additional parameters. For documentation please run cell 0.5 or 1.6
ADDITIONAL_PARAMETERS = "" #@param {type: "string"}

if RESUME_FROM_CHECKPOINT:
  ADDITIONAL_PARAMETERS += f" --resume_from_checkpoint {MODEL}"

import re

if re.search("wikitext", DATASET_NAME):
  DATASET = DATASET_NAME.split(":")
  ADDITIONAL_PARAMETERS += f" --dataset_name {DATASET[0]} --dataset_config_name {DATASET[1]}"
else:
   ADDITIONAL_PARAMETERS += f" --dataset_name {DATASET_NAME}"

!pip install -r requirements.txt
!python run_clm.py --model_name_or_path {MODEL} --per_device_train_batch_size {BATCH_SIZE} --per_device_eval_batch_size {BATCH_SIZE} --do_train --do_eval --output_dir /content/{OUTPUT_NAME} --overwrite_output_dir --save_steps={SAVE_STEPS} --num_train_epochs {TRAIN_EPOCHS} --logging_steps 100 --save_total_limit 2 {ADDITIONAL_PARAMETERS}

print("[INFO] Success Fine Tuning Model")

In [None]:
#@title 0.4 Save data to Drive
#@markdown Enter the checkpoint folder name (example: checkpoint-10000). Enter "-" if you only want to save the last checkpoint
STEP = "-" #@param {type: "string"}

%cd ../../../..
!mkdir /content/Drive/MyDrive/{OUTPUT_NAME}

if STEP != "-":
  !cp -r {OUTPUT_NAME}/{STEP}/config.json /content/Drive/MyDrive/{OUTPUT_NAME}
  !cp -r {OUTPUT_NAME}/{STEP}/generation_config.json /content/Drive/MyDrive/{OUTPUT_NAME}
  !cp -r {OUTPUT_NAME}/{STEP}/merges.txt /content/Drive/MyDrive/{OUTPUT_NAME}
  !cp -r {OUTPUT_NAME}/{STEP}/pytorch_model.bin /content/Drive/MyDrive/{OUTPUT_NAME}
  !cp -r {OUTPUT_NAME}/{STEP}/tokenizer.json /content/Drive/MyDrive/{OUTPUT_NAME}
  !cp -r {OUTPUT_NAME}/{STEP}/vocab.json /content/Drive/MyDrive/{OUTPUT_NAME}
else:
  !cp -r {OUTPUT_NAME}/config.json /content/Drive/MyDrive/{OUTPUT_NAME}
  !cp -r {OUTPUT_NAME}/generation_config.json /content/Drive/MyDrive/{OUTPUT_NAME}
  !cp -r {OUTPUT_NAME}/merges.txt /content/Drive/MyDrive/{OUTPUT_NAME}
  !cp -r {OUTPUT_NAME}/pytorch_model.bin /content/Drive/MyDrive/{OUTPUT_NAME}
  !cp -r {OUTPUT_NAME}/tokenizer.json /content/Drive/MyDrive/{OUTPUT_NAME}
  !cp -r {OUTPUT_NAME}/vocab.json /content/Drive/MyDrive/{OUTPUT_NAME}

print("[INFO] Success saving data to Google Drive")

In [None]:
#@title 0.5 Show help for additional parameters
!pip install -r requirements.txt
!python run_clm.py --help

# **1.0 Using Custom Dataset**
Fine-tuning a Large Language Model (LLM) with a custom text dataset involves adapting a pre-trained model by training it on your own dataset. This process aims to make the model more specific to a particular task or domain.

In [None]:
#@title 1.1 Mount Drive

import os
from google.colab import drive

#os.mkdir("Drive")
drive.mount("Drive")

print("[INFO] Success mount drive at /content/Drive")

In [None]:
#@title 1.2 Install Transformers from GitHub
!git clone https://github.com/huggingface/transformers
%cd transformers
!python setup.py install
%cd /content/transformers/examples/pytorch/language-modeling

print("[INFO] Success build Transformers")

In [None]:
#@title 1.3 Splitting the text dataset (TXT) into two files, training and validation
#@markdown Enter the file path dataset text that will be used
DATASET_RAW = "/content/Drive/MyDrive/LN-ID-10K/LN-ID-10K-10000-ch.txt" #@param {type: "string"}
#@markdown Enter the size of the training file
TRAIN_SIZE = 0.85 #@param {type:"slider", min:0.5, max:0.95, step:0.01}

from sklearn.model_selection import train_test_split

with open(DATASET_RAW, 'r') as file:
    dataset = file.readlines()

train_data, valid_data = train_test_split(dataset, train_size=TRAIN_SIZE, random_state=42)

with open('train.txt', 'w') as file:
    file.writelines(train_data)

with open('valid.txt', 'w') as file:
    file.writelines(valid_data)

print("[INFO] Success splitting data")

In [None]:
#@title 1.4 Fine-Tuning Model
#@markdown Choose a natural language processing pretrained model or enter your last checkpoint folder path (if you wanna resume training from checkpoint). It is recommended to use a model with small parameters if you are using a free GPU
MODEL = "openai-gpt" #@param ["openai-gpt", "distilgpt2", "gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl", "facebook/opt-125m", "facebook/opt-350m", "facebook/opt-2.7b", "facebook/xglm-564M", "facebook/xglm-7.5B", "EleutherAI/gpt-neo-125m", "EleutherAI/gpt-neo-1.3B", "EleutherAI/gpt-j-6b", "EleutherAI/pythia-14m", "EleutherAI/pythia-31m", "EleutherAI/pythia-70m", "EleutherAI/pythia-160m", "EleutherAI/pythia-410m" , "EleutherAI/pythia-1b", "roneneldan/TinyStories-1M", "roneneldan/TinyStories-3M", "roneneldan/TinyStories-8M", "roneneldan/TinyStories-33M"] {allow-input: true}
#@markdown Enter the model name for the output (without spaces, or replace using "-")
OUTPUT_NAME = "MikaGPT" #@param {type: "string"}
#@markdown Enter a number for the number of steps for each stored checkpoint. Enter -1 if you want to save only the last checkpoint
SAVE_STEPS = 100 #@param {type: "integer"}
#@markdown Enter the batch size for each device. Batch size is the sample dataset that is given to the model in one iteration. The more batch sizes, the smaller the gradient variations and the greater the GPU usage. On the other hand, if it is too small, the gradient variation will increase but may cause convergence
BATCH_SIZE = 11 #@param {type:"slider", min:1, max:64, step:1}
#@markdown Enter the number of Epochs for the Fine-Tuning process. Epochs are the number of iterations of one dataset given to the model. The more, the process will take longer
TRAIN_EPOCHS = 3 #@param {type:"slider", min:1, max:100, step:1}
#@markdown Select if you want to continue model training from the checkpoint
RESUME_FROM_CHECKPOINT = False #@param {type:"boolean"}
#@markdown Enter additional parameters, leave blank if there are no additional parameters. For documentation please run cell 0.5 or 1.6
ADDITIONAL_PARAMETERS = "" #@param {type: "string"}

if RESUME_FROM_CHECKPOINT:
  ADDITIONAL_PARAMETERS += f" --resume_from_checkpoint {MODEL}"

!pip install -r requirements.txt
!python run_clm.py --model_name_or_path {MODEL} --train_file train.txt --validation_file valid.txt --per_device_train_batch_size {BATCH_SIZE} --per_device_eval_batch_size {BATCH_SIZE} --do_train --do_eval --output_dir /content/{OUTPUT_NAME} --overwrite_output_dir --save_steps={SAVE_STEPS} --num_train_epochs {TRAIN_EPOCHS} --logging_steps 100 --save_total_limit 2 {ADDITIONAL_PARAMETERS}

print("[INFO] Success Fine Tuning Model")

In [None]:
#@title 1.5 Save data to Drive
#@markdown Enter the checkpoint folder name (example: checkpoint-10000). Enter "-" if you only want to save the last checkpoint
STEP = "-" #@param {type: "string"}

%cd ../../../..
!mkdir /content/Drive/MyDrive/{OUTPUT_NAME}

if STEP != "-":
  !cp -r {OUTPUT_NAME}/{STEP}/config.json /content/Drive/MyDrive/{OUTPUT_NAME}
  !cp -r {OUTPUT_NAME}/{STEP}/generation_config.json /content/Drive/MyDrive/{OUTPUT_NAME}
  !cp -r {OUTPUT_NAME}/{STEP}/merges.txt /content/Drive/MyDrive/{OUTPUT_NAME}
  !cp -r {OUTPUT_NAME}/{STEP}/pytorch_model.bin /content/Drive/MyDrive/{OUTPUT_NAME}
  !cp -r {OUTPUT_NAME}/{STEP}/tokenizer.json /content/Drive/MyDrive/{OUTPUT_NAME}
  !cp -r {OUTPUT_NAME}/{STEP}/vocab.json /content/Drive/MyDrive/{OUTPUT_NAME}
else:
  !cp -r {OUTPUT_NAME}/config.json /content/Drive/MyDrive/{OUTPUT_NAME}
  !cp -r {OUTPUT_NAME}/generation_config.json /content/Drive/MyDrive/{OUTPUT_NAME}
  !cp -r {OUTPUT_NAME}/merges.txt /content/Drive/MyDrive/{OUTPUT_NAME}
  !cp -r {OUTPUT_NAME}/pytorch_model.bin /content/Drive/MyDrive/{OUTPUT_NAME}
  !cp -r {OUTPUT_NAME}/tokenizer.json /content/Drive/MyDrive/{OUTPUT_NAME}
  !cp -r {OUTPUT_NAME}/vocab.json /content/Drive/MyDrive/{OUTPUT_NAME}

print("[INFO] Success saving data to Google Drive")

In [None]:
#@title 1.6 Show help for additional parameters
!pip install -r requirements.txt
!python run_clm.py --help

# **2.0 Inference**
Inference is the process of using a previously trained machine learning model to perform certain predictions or tasks on new data. In inference, the model takes new input and produces the expected output based on the knowledge that has been obtained during the training process. The main goal is to apply trained models to solve real-world problems and produce useful results in unfamiliar situations.

In [None]:
#@title 2.1 Mount Drive

import os
from google.colab import drive

#os.mkdir("Drive")
drive.mount("Drive")

print("[INFO] Success mount drive at /content/Drive")

In [None]:
#@title 2.2 Install Transformers 4.30.0
#@markdown Install Transformers for inference
!pip install transformers==4.30.0
print("[INFO] Please restart runtime to apply changes ")

In [None]:
#@title 2.3 Inference model
#@markdown Enter the path folder where the model is located or select available models
MODEL = "openai-gpt" #@param ["openai-gpt", "distilgpt2", "gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl", "facebook/opt-125m", "facebook/opt-350m", "facebook/opt-2.7b", "facebook/xglm-564M", "facebook/xglm-7.5B", "EleutherAI/gpt-neo-125m", "EleutherAI/gpt-neo-1.3B", "EleutherAI/gpt-j-6b", "EleutherAI/pythia-14m", "EleutherAI/pythia-31m", "EleutherAI/pythia-70m", "EleutherAI/pythia-160m", "EleutherAI/pythia-410m" , "EleutherAI/pythia-1b", "roneneldan/TinyStories-1M", "roneneldan/TinyStories-3M", "roneneldan/TinyStories-8M", "roneneldan/TinyStories-33M"] {allow-input: true}
#@markdown Enter the prompt text that you want the model to generate
INPUT = "" #@param {type: "string"}
#@markdown Enter the maximum word length to be generated by the model
MAX_LENGTH = 200 #@param {type:"slider", min:25, max:1000, step:1}

from transformers import pipeline
model = pipeline("text-generation", model=MODEL)
teks = model(INPUT, max_length=MAX_LENGTH)
print(teks[0]["generated_text"])