## Setting up working directory

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [7]:
import os
working_directory = '/content/drive/MyDrive/Topic_Modeling'
if os.getcwd() !=  working_directory:
  os.chdir(working_directory)
os.getcwd()

'/content/drive/MyDrive/Topic_Modeling'

## Installing the packages

In [19]:
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes

!pip install wandb

In [5]:
!nvidia-smi

Mon Jun 24 04:28:56 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   55C    P8              12W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Model setup

In [6]:
from unsloth import FastLanguageModel
import torch
import wandb

# Wandb integration
wandb.login()
wandb.init(project="Dynamic_topic_generation_Llama3")

max_seq_length = 2048
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # 4bit quantization to reduce memory usage.

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mmeghanaraobn2020[0m ([33mmeghanaraobn[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [7]:
# Load model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit", # pre-quantized 4-bit Llama 3 model from unsloth
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Llama patching release 2024.6
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [8]:
# LoRA adapters are added and so only 1 to 10% of all parameters are updated
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Any, but = 0 is optimized
    bias = "none",    # Any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # Rank stabilized LoRA
    loftq_config = None, # LoftQ
)

Unsloth 2024.6 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## Data preparation

In [9]:
# Promt format
prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

In [10]:
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
# Format prompts for each sample in the dataset.
def format_prompts(dataset):
    instructions = "Please generate a meaningful topic for the following article."
    texts = []
    for abstract, topic in zip(dataset["Abstract"], dataset["Topic"]):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = prompt.format(instructions, abstract, topic) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

In [11]:
# Loading 'dynamic_topic_modeling_arxiv_abstracts' dataset from Hugging Face and formating prompts for each sample in the dataset.
from datasets import load_dataset

dataset = load_dataset("ankitagr01/dynamic_topic_modeling_arxiv_abstracts", split = "train")
eval_dataset = load_dataset("ankitagr01/dynamic_topic_modeling_arxiv_abstracts", split="test")
dataset = dataset.map(format_prompts, batched = True,)
eval_dataset = eval_dataset.map(format_prompts, batched = True,)

Downloading readme:   0%|          | 0.00/193 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/16.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.15M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/15000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/15000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [12]:
dataset[0]

{'Abstract': '  The problem of statistical learning is to construct a predictor of a random\nvariable $Y$ as a function of a related random variable $X$ on the basis of an\ni.i.d. training sample from the joint distribution of $(X,Y)$. Allowable\npredictors are drawn from some specified class, and the goal is to approach\nasymptotically the performance (expected loss) of the best predictor in the\nclass. We consider the setting in which one has perfect observation of the\n$X$-part of the sample, while the $Y$-part has to be communicated at some\nfinite bit rate. The encoding of the $Y$-values is allowed to depend on the\n$X$-values. Under suitable regularity conditions on the admissible predictors,\nthe underlying family of probability distributions and the loss function, we\ngive an information-theoretic characterization of achievable predictor\nperformance in terms of conditional distortion-rate functions. The ideas are\nillustrated on the example of nonparametric regression in Gauss

## Model train

In [24]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

# Huggingface TRL's SFTTrainer
trainer = SFTTrainer(
    model = model, # The pre-trained model to be fine-tuned
    tokenizer = tokenizer, # The tokenizer associated with the model
    train_dataset = dataset, # Dataset used for training
    eval_dataset=eval_dataset, # Dataset used for evaluation
    dataset_text_field = "text", # Only take 'text' part from the dataset
    max_seq_length = max_seq_length,  # Maximum sequence length for input text
    dataset_num_proc = 2,  # Number of processes for dataset processing
    packing = False,  # When set to True, it makes training faster by combining multiple short sequences into a single long sequence.
    args = TrainingArguments(
        per_device_train_batch_size = 8, # Batch size for training
        gradient_accumulation_steps = 8,  # Number of steps to accumulate gradients
        warmup_steps = 5, # Number of warmup steps for learning rate scheduler
        num_train_epochs = 1, # Number of epochs for training
        learning_rate = 2e-4,  # Initial learning rate
        fp16 = not is_bfloat16_supported(),  # Use 16-bit precision if bfloat16 is not supported
        bf16 = is_bfloat16_supported(), # Use bfloat16 precision if supported
        logging_steps = 100,  # Log every 100 steps
        eval_strategy = "steps",
        eval_steps = 200, # Evaluation step interval
        save_steps = 200, # Model save step interval
        optim = "adamw_8bit",  # AdamW optimizer in 8-bit precision to reduce memory usage.
        weight_decay = 0.01, # Weight decay for the optimizer
        lr_scheduler_type = "linear", # Learning rate scheduler
        seed = 3407, # Random seed for reproducibility
        output_dir = "outputs", # Directory to save model checkpoints
        report_to = "wandb", # Reporting tool for logging in wandb
        logging_dir = "./logs", # Directory for logging
        run_name = "Dynamic_topic_generation_Llama3",
    ),
)

In [25]:
# Current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
10.389 GB of memory reserved.


In [26]:
# Model training starts
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 15,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 8
\        /    Total batch size = 64 | Total steps = 234
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss,Validation Loss
200,1.692,1.700544


In [27]:
# Save trained model
if True:
  # This only saves the LoRA adapters, and not the full model
  model.save_pretrained("fine_tuned_model") # Local saving
  tokenizer.save_pretrained("fine_tuned_model")

#  merged_16bit for float16 or merged_4bit for int4
if False: model.save_pretrained_merged("fine_tuned_model", tokenizer, save_method = "merged_16bit")
if False: model.save_pretrained_merged("fine_tuned_model", tokenizer, save_method = "merged_4bit")
if False: model.save_pretrained_merged("fine_tuned_model", tokenizer, save_method = "lora")

In [28]:
# Finish W&B run
wandb.finish()

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▁▁▁▁▁▁▁▄▇▇█
train/global_step,▁▁▂▂▂▃▃▃▃▄▄▇▇█
train/grad_norm,█▅▄▄▄▅▄▄▄▃▁▁
train/learning_rate,██████████▅▁
train/loss,█▃▂▂▁▁▂▁▂▂▂▂

0,1
eval/loss,1.70054
eval/runtime,351.8239
eval/samples_per_second,2.842
eval/steps_per_second,0.355
total_flos,2.5851113110614835e+17
train/epoch,0.9984
train/global_step,234.0
train/grad_norm,0.11882
train/learning_rate,3e-05
train/loss,1.692


In [29]:
# Final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

15976.616 seconds used for training.
266.28 minutes used for training.
Peak reserved memory = 11.094 GB.
Peak reserved memory for training = 0.705 GB.
Peak reserved memory % of max memory = 75.224 %.
Peak reserved memory for training % of max memory = 4.78 %.


## Model inference

In [1]:
from unsloth import FastLanguageModel
import torch
import wandb

max_seq_length = 2048
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # 4bit quantization to reduce memory usage.

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [2]:
# Promt format
prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

In [8]:
# Load fine-tuned model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "fine_tuned_model",
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
        device_map="auto"
    )
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

==((====))==  Unsloth: Fast Llama patching release 2024.6
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Unsloth 2024.6 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [9]:
def extract_response_text(output_text):
  response_marker = "### Response:"
  start_index = output_text.find(response_marker)
  if start_index == -1:
    return ""

  start_index += len(response_marker)

  # Extract only the response text following the '### Response:', excluding any subsequent text.
  end_index = output_text.find("###", start_index)
  if end_index == -1:
    end_index = len(output_text)

  response_text = output_text[start_index:end_index].strip()
  return response_text

In [17]:
instruction = "Please generate a meaningful topic for the following article."
input_text = "Fingerprint morphing is the process of combining two or more distinct fingerprints to create a new, morphed fingerprint that includes identity-related characteristics of all constituent fingerprints. Previously, this was done by either applying a model-based minutiaeoriented approach or a data-driven approach based on a Generative Adversarial Network (GAN). The model-based approach provides the ability to manage the number of minutiae coming from the fingerprints, but the resulting fingerprint often appears unrealistic. On the other hand, the data-driven approach produces realistic fingerprints, but it does not guarantee that the resulting fingerprint matches the original fingerprints. In this work, we introduce an algorithm that combines minutiae-oriented and GAN-based approaches to generate morphed fingerprints that look realistic and match their original fingerprints. The algorithm is initially designed to generate double-identity fingerprints and is further extended to generate triple-identity fingerprints. The results of our experiments indicate that the generated fingerprints appear realistic and the majority of them can be seen as double-identity fingerprints. The fingerprints resulting from morphing three fingerprints are unlikely to be triple-identity fingerprints, but rather anonymous ones matching none of the constituent original fingerprints."
output = ""

In [18]:
# Tokenize inputs
inputs = tokenizer(
    [
        prompt.format(
            instruction,  # instruction
            input_text,   # input
            output,       # output - blank for generation!
        )
    ],
    return_tensors="pt"
).to("cuda")

# generate output
output = model.generate(**inputs, max_new_tokens = 10, use_cache = True)
response = extract_response_text(tokenizer.batch_decode(output, skip_special_tokens=True)[0])
print("====================================================================================")
print(f"Generated Topic: {response}")
print("====================================================================================")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Generated Topic: Fingerprint Morphing
