<a href="https://colab.research.google.com/github/nisha432/llm-project-/blob/main/llm_project_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - Large Language Model (LLM)
##### **Contribution**    - Team
##### **Team Member 1 -**
##### **Team Member 2 -**


# **Project Summary -**

In this capstone project, we undertook the exciting challenge of developing an Industry-Specific Large Language Model (LLM) Bot by leveraging cutting-edge pre-trained models, such as those available on platforms like Hugging Face. The main objective was to build an intelligent bot capable of engaging with users in real-time, providing accurate answers, and offering valuable insights specific to a chosen industry. We selected an industry from healthcare field .

We began by conducting thorough research to identify the specific challenges, common user queries, relevant terminology, and regulatory constraints of our chosen industry. Each industry had its own set of specialized knowledge and language, so understanding these nuances was essential for developing a bot that could communicate effectively with users in that field. A healthcare bot needed to handle medical terminology.This foundational understanding guided the customization of our LLM for the specific context.

Once we had a clear grasp of the industry’s requirements, we moved on to fine-tuning a pre-trained LLM. Hugging Face offered access to numerous state-of-the-art models that had been trained on vast amounts of general knowledge. However, these models needed to be adapted for the specific industry in which we were working. Fine-tuning involved adjusting the pre-trained model by training it further on a specialized dataset that contained domain-specific knowledge. This process ensured that the model was not only proficient in general conversation but also able to handle technical terms, provide accurate answers, and respond appropriately to the unique demands of the industry.

We compiled relevant data sources to train the model. This allowed the LLM to understand both the vocabulary and the subtleties of the field.

In addition to fine-tuning the LLM, we also designed and implemented a user-friendly interface that enabled seamless interaction with the bot. This involved creating a chat system through which users could engage with the bot and ask questions. The interface had to be intuitive and responsive, allowing users to receive quick and accurate answers. We also ensured that the LLM bot could handle multiple types of queries, from simple questions to more complex, context-sensitive inquiries.

Testing and evaluation were critical parts of the project. We assessed the bot’s performance by running a series of tests to ensure its accuracy, relevance, and ability to handle a variety of user inputs. During this phase, we refined the bot by fine-tuning its responses, improving its ability to deal with ambiguity, and adjusting its tone and style to make it more natural and user-friendly. It was essential that the bot delivered not only precise information but also responded in a way that felt conversational and approachable, as users were more likely to engage with it if it felt like a trusted source of information.

The final outcome of the project was a fully functional Industry-Specific LLM Bot capable of engaging with users, answering their questions, and providing insights relevant to the industry it served. This project showcased our ability to work with advanced Natural Language Processing (NLP) models and demonstrated our understanding of industry-specific challenges. We successfully created a practical AI solution that met real-world needs.

Through this capstone project, we gained hands-on experience in working with some of the most advanced AI tools available, deepened our knowledge of NLP, and developed a sophisticated product with tangible applications in business and other sectors. Moreover, the process allowed us to apply AI to real-world scenarios, reinforcing the value of AI-driven solutions in addressing industry-specific problems and enhancing decision-making and user interaction within any chosen field.





# **GitHub Link -**

https://github.com/nisha432/llm-project-

# **Problem Statement**


The rapid growth of artificial intelligence has highlighted the need for specialized AI solutions tailored to specific industries. While general-purpose chatbots and large language models (LLMs) exist, they often lack the domain-specific knowledge required to address the unique challenges and terminology of industries like healthcare, finance, and education. This project aims to develop an industry-specific LLM bot that can understand and respond to user inquiries with accuracy and relevance. By fine-tuning pre-trained models with domain-specific data, the bot will provide intelligent, context-aware responses, improving user engagement and decision-making in various professional sectors.

In [1]:
!pip install unsloth -qqq

!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" -qqq

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.4/59.4 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m167.0/167.0 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.5/209.5 MB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.9/310.9 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.7/16.7 MB[0m [31m74.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


Installing depedencies. The first line installs unsloth normally. The second line uninstalls it, then reinstalls a specific version of unsloth from GitHub (with extra dependencies for Google Colab), ensuring you're using the latest version from the repository and not from the cache.

In [3]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

In [4]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-v0.3", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,

)

==((====))==  Unsloth 2024.11.9: Fast Mistral patching. Transformers = 4.46.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/4.14G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/157 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/137k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/446 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

In [5]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

In [6]:
# Import necessary libraries
from transformers import TextStreamer
from unsloth import FastLanguageModel

In [7]:
# Prepare your model (assuming it's loaded as `model`)
FastLanguageModel.for_inference(model)  # Enable 2x faster inference

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32768, 4096, padding_idx=770)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): M

In [8]:
# Properly format the instruction, input, and response placeholders in alpaca_prompt
instruction = "Answer this question truthfully"
input_context="What are the symptoms of Allergy?"
response_placeholder = ""  # Since you're generating, the response will be empty

In [9]:
# Format the prompt using the above instruction, input, and placeholder for response
formatted_prompt = alpaca_prompt.format(instruction, input_context, response_placeholder)

In [10]:
# Tokenize the formatted prompt and prepare it for the model
inputs = tokenizer([formatted_prompt], return_tensors="pt").to("cuda")

In [11]:

# Instantiate a TextStreamer to handle output streaming
text_streamer = TextStreamer(tokenizer)

In [12]:
# Generate a response from the model with a max of 128 new tokens
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=128)

<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Answer this question truthfully

### Input:
What are the symptoms of Allergy?

### Response:

Allergy is a condition in which the body's immune system reacts to a substance that is normally harmless. Symptoms of allergy can vary depending on the type of allergen and the individual's sensitivity. Common symptoms include:

- Itchy, watery eyes
- Runny or stuffy nose
- Sneezing
- Coughing
- Wheezing
- Skin rash or hives
- Swelling of the lips, tongue, or throat
- Difficulty breathing

If you are experiencing any of these symptoms, it is important to


1. from transformers import TextStreamer Imports TextStreamer from the transformers library. This class streams the text output token by token during model inference, giving real-time feedback as the model generates responses.

2. from unsloth import FastLanguageModel Imports FastLanguageModel from the unsloth library, which provides fast and efficient loading and inference capabilities for pre-trained models.

3. FastLanguageModel.for_inference(model) This function enables optimizations for faster inference (2x speed) on the pre-loaded model. It likely adjusts the model's settings for efficient inference (e.g., faster GPU computation, memory optimizations).

4. instruction = "Answer this question truthfully" Defines the instruction for the model, which is a guiding sentence that tells the model how to respond. In this case, the model is instructed to answer questions truthfully.

5. input_context = "What are the symptoms of Allergy?" Defines the input or the context of the question that you want the model to answer. This is the prompt or user input.

6. response_placeholder = "" Defines a placeholder for the model's response, which will be empty initially since the model hasn't generated any response yet. This will be populated with the model's output.

7. formatted_prompt = alpaca_prompt.format(instruction, input_context, response_placeholder)

8. alpaca_prompt: This is likely a predefined prompt template designed for Alpaca-like models (models fine-tuned on the Alpaca dataset). This step formats the final input prompt by combining the instruction, input context, and the empty response placeholder into a structured prompt. The model will use this formatted prompt to generate a response.

9. inputs = tokenizer([formatted_prompt], return_tensors="pt").to("cuda") This step tokenizes the formatted prompt using the model's tokenizer. The tokenizer breaks the prompt into tokens (numerical representations) that the model can understand. return_tensors="pt": This converts the tokenized input into PyTorch tensors, which are the data structure that models in PyTorch work with. .to("cuda"): Moves the tokenized tensors to the GPU (if available), speeding up the processing and inference by leveraging GPU acceleration.


10. text_streamer = TextStreamer(tokenizer) This initializes the TextStreamer, which streams the output generated by the model in real-time. As the model generates each token, the TextStreamer will display it, giving an interactive experience while the model responds.


11. _ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=128) model.generate(): This method generates a response from the model based on the tokenized input. inputs: The tokenized prompt is passed to the model for processing. streamer=text_streamer: The model’s output will be streamed via the TextStreamer, displaying tokens as they are generated. max_new_tokens=128: The model will generate a maximum of 128 new tokens in its response





In [13]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.11.9 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


1. model = FastLanguageModel.get_peft_model() This method applies PEFT (Parameter Efficient Fine-Tuning) to the existing model, allowing for fine-tuning with fewer trainable parameters while keeping most of the model frozen. This is ideal for saving memory and speeding up training. PEFT uses techniques like LoRA (Low-Rank Adaptation) to efficiently fine-tune the model. Parameters of get_peft_model():

2. model: The base model you are applying PEFT to, which has already been loaded.

3. r = 16: r is a key parameter in LoRA. It determines the rank of the low-rank adaptation. 16 is suggested here, but you can experiment with values like 8, 32, 64, or 128 depending on memory and performance trade-offs. Larger values result in more adaptable and complex models, while smaller values reduce memory consumption.

4. target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]: These are the specific layers of the model that LoRA will be applied to for fine-tuning. q_proj, k_proj, v_proj: These are the Query, Key, and Value projection layers in the attention mechanism. o_proj: Output projection layer in the attention mechanism. gate_proj, up_proj, down_proj: Layers in the feed-forward network. By applying LoRA to these specific layers, the model’s memory usage and fine-tuning efficiency are optimized.

5. lora_alpha = 16: lora_alpha is a scaling factor for LoRA. It controls how much the LoRA adjustment should influence the model’s weights. 16 is chosen here, but it can be adjusted to increase or decrease the effect of the low-rank adaptation.

6. lora_dropout = 0: lora_dropout sets the dropout rate for LoRA layers. 0 means no dropout, which is optimized for faster training, but you can adjust this if you want regularization to avoid overfitting.

7. bias = "none": This option configures how biases are handled in the LoRA layers. "none" means no bias terms are added, which is optimized for memory and speed. However, you can set it to "all" or "lora_only" if bias terms are needed in specific layers.

8. use_gradient_checkpointing = "unsloth": Gradient checkpointing saves memory by not storing intermediate activations during the forward pass. Instead, these activations are recomputed during the backward pass. "unsloth" is a customized setting that reduces VRAM usage by 30% and allows for 2x larger batch sizes, optimized for long sequences. You can also set it to True to enable standard gradient checkpointing.

9. random_state = 3407: This sets the seed for random number generation to ensure reproducibility. Setting random_state = 3407 guarantees that the same results are produced every time the model is run with the same inputs.

10. use_rslora = False: RS-LoRA (Rank Stabilized LoRA) is an extension of LoRA that helps with fine-tuning stability, particularly for larger models. False means it's disabled here, but you can enable it if you need more stable fine-tuning for very large models.

11. loftq_config = None: LoFTQ refers to Low-rank Fine-tuning with Quantization, which combines quantization and low-rank adaptation. None means that LoFTQ is not applied here, but you could provide a configuration if you want to use this advanced memory-saving technique.

In [14]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

In [15]:
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

In [16]:
from datasets import load_dataset
dataset1 = load_dataset("medalpaca/medical_meadow_wikidoc_patient_information", split = "train")
dataset = dataset1.map(formatting_prompts_func, batched = True,)

README.md:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

medical_meadow_wikidoc_patient_info.json:   0%|          | 0.00/3.49M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5942 [00:00<?, ? examples/s]

Map:   0%|          | 0/5942 [00:00<?, ? examples/s]

In [17]:
print(dataset1)

Dataset({
    features: ['input', 'output', 'instruction'],
    num_rows: 5942
})



1. alpaca_prompt (Defining the Prompt Template): This defines a prompt template where: Instruction: Describes a task or asks a question. Input: Provides additional context or specific details for the task. Response: The model's output that appropriately addresses the instruction using the input. The {} are placeholders for the instruction, input, and response that will be filled in during formatting.

2. EOS_TOKEN = tokenizer.eos_token: Retrieves the end-of-sequence token (EOS_TOKEN) from the tokenizer, which marks the end of the generated sequence. It ensures that the model knows when to stop generating text. Without this token, generation might continue indefinitely.

3. Defining the Function formatting_prompts_func(): Purpose: This function formats the input data (instructions, inputs, and outputs) using the alpaca_prompt template. Steps: instructions, inputs, and outputs are extracted from the dataset examples (each example contains these fields). zip(): Loops through corresponding items from instructions, inputs, and outputs. alpaca_prompt.format(): Fills the placeholders {}, {}, and {} in the alpaca_prompt template with each example's instruction, input, and output. Adds the EOS_TOKEN at the end of the formatted text to ensure the model knows when to stop generating. texts.append(text): Stores each formatted text (prompt + EOS_TOKEN) in a list. Finally, the function returns the formatted prompts in the form of a dictionary with the key "text".

4. Loading and Formatting the Dataset: load_dataset("medalpaca/medical_meadow_wikidoc_patient_information", split = "train"): Loads the medical_meadow_wikidoc_patient_information dataset from the medalpaca collection. This dataset contains medical-related tasks and context. The split = "train" argument specifies that the training portion of the dataset is being loaded. dataset.map(formatting_prompts_func, batched = True): map() applies the formatting_prompts_func() to each batch of examples in the dataset. batched = True ensures that the function processes the examples in batches for faster and more efficient data formatting.


In [18]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported


In [19]:
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 25, # Set num_train_epochs = 1 for full training runs
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)
{"model_id":"459eca53d3d74992bfaab3955413377d","version_major":2,"version_minor":0}


Map (num_proc=2):   0%|          | 0/5942 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


{'model_id': '459eca53d3d74992bfaab3955413377d',
 'version_major': 2,
 'version_minor': 0}

max_steps is given, it will override any value given in num_train_epochs 1, SFTTrainer: A trainer class from the TRL (Transformers Reinforcement Learning) library designed to handle Supervised Fine-Tuning (SFT) of models. TrainingArguments: Used to define the training configuration, such as batch size, learning rate, number of steps, etc. is_bfloat16_supported: A utility function from unsloth to check whether bfloat16 precision is supported on the current hardware. per_device_train_batch_size = 2: Specifies the number of training examples processed per device (e.g., GPU) at a time. A batch size of 2 is used here to balance memory usage with computational performance. gradient_accumulation_steps = 4: Gradients are accumulated over 4 steps before a backpropagation update is performed. Effectively, this increases the batch size to 2 x 4 = 8, which helps when memory is constrained. warmup_steps = 5: The number of warmup steps before the learning rate reaches its maximum value. During these steps, the learning rate gradually increases from zero. max_steps = 25: The maximum number of steps to train the model. This is a small number, suggesting that this might be for testing or a small fine-tuning run. For full training, you could use a larger value or set num_train_epochs = 1 to run for a full epoch. learning_rate = 2e-4: The initial learning rate used for training, set to 0.0002. This learning rate is critical for determining how fast or slow the model learns. fp16 = not is_bfloat16_supported(): fp16 (16-bit floating point precision) is enabled if bfloat16 is not supported. This reduces memory usage and speeds up training while maintaining reasonable precision. bf16 = is_bfloat16_supported(): bfloat16 is used if the current hardware supports it (e.g., newer GPUs like Ampere)bfloat16 is preferred over fp16 because it avoids precision issues, especially in gradient computations. logging_steps = 1: Logs the training progress after every 1 step, allowing close monitoring of the training process. optim = "adamw_8bit": Uses the 8-bit AdamW optimizer, a memory-efficient version of the standard AdamW optimizer that reduces memory consumption for model weights. weight_decay = 0.01: Applies a weight decay of 1% to regularize the model and prevent overfitting by shrinking the model weights over time. lr_scheduler_type = "linear": Specifies a linear learning rate scheduler, where the learning rate decreases linearly after the warmup period. seed = 3407: A fixed seed for reproducibility. Setting the same seed ensures that running the code multiple times produces the same results. output_dir = "outputs": The directory where the training outputs (e.g., checkpoints, logs) will be saved. The folder is named "outputs".

1. Instantiating the Trainer: SFTTrainer model = model: Specifies the model that you will fine-tune. tokenizer = tokenizer: Uses the tokenizer corresponding to the model to properly tokenize the input data. train_dataset = dataset: The dataset used for training, which is loaded and processed earlier. dataset_text_field = "text": Specifies the field in the dataset containing the formatted text (the one generated by the function earlier). max_seq_length = max_seq_length: Limits the maximum sequence length for the inputs. This prevents memory overflow and ensures consistency in input size. dataset_num_proc = 2: Sets the number of parallel processes for dataset processing (speeds up data handling). packing = False: Disables input packing, which could speed up training by combining short sequences into a batch. However, for longer sequences, packing is unnecessary.

2. Training Arguments: TrainingArguments() These arguments define various training configurations:



In [20]:

trainer_stats = trainer.train()


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 5,942 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 25
 "-____-"     Number of trainable parameters = 41,943,040
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss
1,1.9114
2,1.9388
3,2.0344
4,1.6
5,1.4743
6,1.2707
7,1.2156
8,1.3296
9,1.1359
10,1.0791


In [21]:
model.save_pretrained("PROJECT_model") # Local saving
tokenizer.save_pretrained("PROJECT_model")
('PROJECT_model/tokenizer_config.json',
 'PROJECT_model/special_tokens_map.json',
 'PROJECT_model/tokenizer.model',
 'PROJECT_model/added_tokens.json',
 'PROJECT_model/tokenizer.json')

('PROJECT_model/tokenizer_config.json',
 'PROJECT_model/special_tokens_map.json',
 'PROJECT_model/tokenizer.model',
 'PROJECT_model/added_tokens.json',
 'PROJECT_model/tokenizer.json')

In [22]:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "/content/PROJECT_model", # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

==((====))==  Unsloth 2024.11.9: Fast Mistral patching. Transformers = 4.46.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth: Will load /content/PROJECT_model as a legacy tokenizer.


PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32768, 4096, padding_idx=770)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj):

In [23]:
# Define the input prompt
formatted_prompt = alpaca_prompt.format(
    "Answer this question truthfully",  # instruction
    "What are the symptoms of Allergy?",  # input
    ""  # output - leave this blank for generation!
)



In [24]:
# Tokenize the prompt
inputs = tokenizer(
    [formatted_prompt],
    return_tensors="pt"
).to("cuda")


In [25]:
# Generate the response with the model
outputs = model.generate(**inputs, max_new_tokens=512, use_cache=True)


In [26]:
# Decode the output into text
decoded_output = tokenizer.batch_decode(outputs, skip_special_tokens=True)

In [27]:
# Print the generated response in a clearer format
print("Generated Response: \n")
print(decoded_output[0])


Generated Response: 

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Answer this question truthfully

### Input:
What are the symptoms of Allergy?

### Response:
Symptoms of an allergic reaction depend on the type of allergen and the person's sensitivity to it.
Symptoms may include:
Sneezing Runny or stuffy nose Itchy, watery eyes Itchy nose, throat, or roof of the mouth Itchy ears Coughing Wheezing Chest tightness Shortness of breath Hives or swelling of the skin (anaphylaxis)
Symptoms of an allergic reaction may be mild or severe. They may occur within minutes of exposure to an allergen or may not appear for several hours.


In [28]:
# Define the input prompt
formatted_prompt = alpaca_prompt.format(
    "Answer this question truthfully",  # instruction
    "What causes Allergy?",  # input
    ""  # output - leave this blank for generation!
)


In [29]:
# Tokenize the prompt
inputs = tokenizer(
    [formatted_prompt],
    return_tensors="pt"
).to("cuda")


In [30]:
# Generate the response with the model
outputs = model.generate(**inputs, max_new_tokens=512, use_cache=True)

In [31]:
# Decode the output into text
decoded_output = tokenizer.batch_decode(outputs, skip_special_tokens=True)

In [32]:
# Print the generated response in a clearer format
print("Generated Response: \n")
print(decoded_output[0])


Generated Response: 

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Answer this question truthfully

### Input:
What causes Allergy?

### Response:
Allergies are caused by a reaction to a foreign substance. The immune system normally protects the body from disease. In people with allergies, the immune system reacts to a foreign substance as though it were harmful. This substance is called an allergen.
Allergens are usually harmless substances that do not cause a reaction in most people. When a person with allergies comes into contact with an allergen, the immune system responds by making antibodies. These antibodies cause the release of chemicals that lead to allergy symptoms.
Allergens can be found in the air, on the skin, or in food.
Allergens in the air include:
Pollen from trees, grass, and weeds Dust mites Mold spores Animal dander (skin flakes)
Allerge

**CONCLUSION**

In conclusion, this capstone project offers us an exciting and valuable opportunity to create an industry-specific Large Language Model (LLM) Bot for the healthcare sector, using state-of-the-art pre-trained models, such as those available through Hugging Face. By undertaking this project, we will apply our knowledge of machine learning, natural language processing (NLP), and healthcare-specific insights to develop an intelligent system capable of engaging with users in a meaningful way. The process involves carefully selecting the right model architecture, fine-tuning it with relevant healthcare data, and ensuring it can provide accurate and context-aware responses to meet the needs of patients, medical professionals, and healthcare providers.

Throughout this journey, we will deepen our understanding of how AI technologies, particularly in the field of NLP, are transforming the healthcare industry. AI-driven bots can improve user interactions, automate routine medical inquiries, assist with diagnostic insights, and provide reliable healthcare information. We will also gain hands-on experience with tools and platforms commonly used in AI, such as Hugging Face, and learn how to integrate pre-trained models into real-world healthcare applications.

The successful completion of this project will not only strengthen our technical expertise but also allow us to appreciate the broader implications of conversational AI in healthcare. We will gain the skills necessary to address industry-specific challenges, contribute to improving patient care, and streamline administrative tasks. Through the use of AI-powered solutions, we can help make healthcare more efficient, accessible, and user-friendly.

Ultimately, this capstone project will serve as a stepping stone in our journey as AI practitioners, equipping us to design and deploy intelligent systems that positively impact healthcare delivery. By leveraging advanced conversational AI technologies, we will be prepared to create solutions that improve patient experiences, support healthcare professionals, and drive innovation in the healthcare industry.



