# **Project Name - IndustryGPT: Specialized LLM Bot Using Pre-Trained Models**

**Deep Learning for NLP**

Project Type - LLM Bot

Contribution - Individual

Student Name - Manasvi Save

# **Project Summary -**

This project focuses on building an industry-specific chatbot using pre-trained Large Language Models (LLMs) from Hugging Face. The model is fine-tuned with domain-relevant data to improve its understanding of industry-specific queries and generate meaningful responses.

Training is performed on Google Colab using a T4 GPU (up to 25 epochs), making the workflow lightweight and accessible. The goal is to develop an intelligent conversational bot capable of answering user questions accurately while gaining hands-on experience with real-world data and model fine-tuning.

# **GitHub Link -**

https://github.com/msave121/Speciallised_LLM_Construction_and_Real_Estate_Chatbot_Using_PreTrained_Models

####1. Installing Required Libraries

In [None]:
!pip -q install -U transformers datasets accelerate peft bitsandbytes sentencepiece


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/10.3 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/10.3 MB[0m [31m102.1 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m10.3/10.3 MB[0m [31m154.6 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m10.3/10.3 MB[0m [31m154.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.3/10.3 MB[0m [31m92.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m515.2/515.2 kB[0m [31m44.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.1/59.1 MB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.6/47.6 MB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[?25h

####2. Loading and Preparing the Dataset

In [None]:
from datasets import load_dataset

ds = load_dataset("json", data_files="propchk_train_500.jsonl")["train"]
ds = ds.train_test_split(test_size=0.05, seed=42)
train_ds = ds["train"]

def to_text(ex):
    return {"text": f"### Instruction:\n{ex['instruction']}\n\n### Response:\n{ex['response']}"}

train_ds = train_ds.map(to_text)


Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/475 [00:00<?, ? examples/s]

###**Applying LoRA Fine-Tuning**

####3. Selecting the Base Model

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model_name = "mistralai/Mistral-7B-Instruct-v0.3"

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb,
    device_map="auto"
)

# ✅ FORCE OFF
model.config.use_cache = False
model.gradient_checkpointing_disable()

# ✅ Some PEFT versions enable checkpointing by default; force it OFF here too
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=False)

# ✅ FORCE grads for inputs
model.enable_input_require_grads()

lora_cfg = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"]
)

model = get_peft_model(model, lora_cfg)
model.print_trainable_parameters()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

trainable params: 20,971,520 || all params: 7,268,995,072 || trainable%: 0.2885


####4. Loading Model in 4-bit Quantization

In [None]:
MAX_LEN = 512

def tok(ex):
    enc = tokenizer(ex["text"], truncation=True, max_length=MAX_LEN, padding="max_length")
    enc["labels"] = enc["input_ids"].copy()
    return enc

train_tok = train_ds.map(tok, remove_columns=train_ds.column_names)


Map:   0%|          | 0/475 [00:00<?, ? examples/s]

####5. Training the Model

In [None]:
from transformers import TrainingArguments, Trainer, default_data_collator

args = TrainingArguments(
    output_dir="propchk_mistral_lora",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=10,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_steps=200,
    save_total_limit=1,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_tok,
    data_collator=default_data_collator
)

trainer.train()


Step,Training Loss
10,1.665
20,0.0887
30,0.0499
40,0.0419
50,0.0354
60,0.0346
70,0.0345
80,0.0347
90,0.0339
100,0.0334


TrainOutput(global_step=1190, training_loss=0.04744420404694662, metrics={'train_runtime': 6553.2498, 'train_samples_per_second': 0.725, 'train_steps_per_second': 0.182, 'total_flos': 1.04110671003648e+17, 'train_loss': 0.04744420404694662, 'epoch': 10.0})

####6. Testing the Model

In [None]:
prompt = """### Instruction:
Living Room: Flooring hollowness observed. What does it mean and what should I do?

### Response:
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

out = model.generate(
    **inputs,
    max_new_tokens=220,
    do_sample=True,
    temperature=0.5,
    top_p=0.9,
    repetition_penalty=1.2,
    no_repeat_ngram_size=4,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id
)

print(tokenizer.decode(out[0], skip_special_tokens=True))


### Instruction:
Living Room: Flooring hollowness observed. What does it mean and what should I do?

### Response:
Issue type: Safety issue
Severity: Major
How to inspect: Standard PropChk inspection procedure.
Recommendation: Rectify as per specification before handover.
Safety instruction: Ensure that safety measures are in place during rectification work.


In [None]:
model.print_trainable_parameters()


trainable params: 20,971,520 || all params: 7,268,995,072 || trainable%: 0.2885


####7. 9. Saving the Model

In [None]:
model.save_pretrained("propchk_mistral_lora")
tokenizer.save_pretrained("propchk_mistral_lora")


('propchk_mistral_lora/tokenizer_config.json',
 'propchk_mistral_lora/special_tokens_map.json',
 'propchk_mistral_lora/chat_template.jinja',
 'propchk_mistral_lora/tokenizer.model',
 'propchk_mistral_lora/added_tokens.json',
 'propchk_mistral_lora/tokenizer.json')

In [None]:
!ls -lh propchk_mistral_lora
print(model.__class__)