## **Automating Instruction Generation with Mistral 7B**



### **Import Libraries**

In [15]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import login
import pandas as pd
import warnings
import torch
import json
import re
import os

### **Setup Hugging Face**

In [2]:
# Add your HF Token Here
hf_token = "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"

In [3]:
# Setting Hugging Face with token
login(token=hf_token)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [4]:
# Setting Hugging Face Environment Variables
# This code is setting environment variables related to Hugging Face's model repository and cache.
# - 'os.environ["HF_HOME"]' is used to define the path for Hugging Face model storage.
# - 'os.environ["HF_HOME"] += "/token"' appends the "/token" directory to the model storage path.
# - Finally, 'os.path.join(os.environ["HF_HOME"], ")"' creates the complete path, ensuring it's properly formatted.
# These environment variables help manage the location for storing Hugging Face models and token information.

os.environ["HF_HOME"] = "/root/.huggingface"
os.environ["HF_HOME"] += "/token"
os.environ["HF_HOME"] = os.path.join(os.environ["HF_HOME"], hf_token)


### **Load Mistral 7B Model and Tokenizer**

In [5]:
!pip install -q autoawq

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.3/84.3 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.6/297.6 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.4/33.4 MB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━

- For Documentation read out this:
- https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-AWQ?library=true

- Here I am using the quantized version of mistral instruct because of limited computing resources

In [5]:
# Check if a CUDA-capable GPU is available
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("Using GPU:", torch.cuda.get_device_name(0))
else:
    device = torch.device("cpu")
    print("Using CPU")

Using GPU: Tesla T4


In [6]:
# Load the model and move it to the device
model_name = "TheBloke/Mistral-7B-Instruct-v0.1-AWQ"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/962 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/757 [00:00<?, ?B/s]

You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors:   0%|          | 0.00/4.15G [00:00<?, ?B/s]



generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

### **Preprocess Documents**

In [8]:
def preprocess_documents(documents):
    # Preprocess documents as needed for Mistral 7B (e.g., cleaning, tokenization)
    preprocessed_docs = []
    for doc in documents:
        # Remove special characters and extra whitespace
        clean_doc = re.sub(r'[^\w\s]', '', doc)
        clean_doc = re.sub(r'\s+', ' ', clean_doc).strip()
        preprocessed_docs.append(clean_doc)
    return preprocessed_docs

### **Generate Instructions Using Mistral Model**

In [114]:
def generate_instructions(preprocessed_documents, tokenizer, model, device, max_length=200):
    instructions = []

    for doc in preprocessed_documents:
        # Split the document into smaller chunks
        chunks = [doc[i:i + max_length] for i in range(0, len(doc), max_length)]

        for chunk in chunks:
          # TODO: Writing a perfect prompt will provide perfect result, need improvement
            # Add a prompt for instruction generation
            prompt = f"""Generate a unique question (without generating an answer) based on the provided text:
            "{chunk}"
            Question:"""

            # Tokenize the chunk and prompt for Mistral 7B
            inputs = tokenizer(prompt, return_tensors="pt")
            # Move the input tensors to the correct device
            inputs = {k: v.to(device) for k, v in inputs.items()}

            # Adjust the generation parameters
            outputs = model.generate(
                **inputs,
                max_length=100,  # Adjust the maximum length of the generated output
                min_length=10,  # Adjust the minimum length of the generated output
                do_sample=True,  # Enable sampling for more diverse outputs
                top_k=50,       # Adjust the top-k filtering
                top_p=0.95,     # Adjust the top-p (nucleus) sampling
                temperature=0.7 # Adjust the temperature for more random or deterministic outputs
            )

            # Decode the generated tokens back to text
            decoded_instructions = tokenizer.batch_decode(outputs, skip_special_tokens=True)

            # Format the instruction as "output"
            output_text = decoded_instructions[0].strip().replace(prompt, "").strip()
            formatted_instruction = f"{output_text}"
            instructions.append(formatted_instruction)

    return instructions


### **Create Training Dataset**

In [115]:
def create_training_data(instructions):
    # Format instructions into training data for your LLM
    training_data = []
    for instruction in instructions:
        # Create a dictionary for the output only
        instruction_dict = {"output": instruction}

        # Add logic to format instruction into training data (e.g., output only)
        training_data.append(instruction_dict)
    return training_data


### **Main Function**

In [116]:
def main(dataset, tokenizer, model, device):
    # Preprocess the documents
    preprocessed_documents = preprocess_documents(dataset)

    # Generate Instructions from the preprocessed documents
    instructions = generate_instructions(preprocessed_documents, tokenizer, model, device)

    # Print generated Instructions
    print("Generated Instructions:")
    for idx, instruction in enumerate(instructions, 1):
        print(f"{instruction}")

    # Create instructions data from the generated instructions
    training_data = create_training_data(instructions)

    # Convert training data to DataFrame
    df = pd.DataFrame(training_data)

    # Write DataFrame to CSV file
    df.to_csv("instructions_data.csv", index=False)

    print("Dataset created successfully.")

In [117]:
# we can take any text dataset here according to our requirements
training_medical_dataset = [
    "The common cold is a viral infection of the nose and throat. Symptoms include a runny or stuffy nose, sore throat, cough, congestion, sneezing, and mild fatigue.",
    "The heart is a muscular organ responsible for pumping blood throughout the body, supplying oxygen and nutrients to the tissues and removing carbon dioxide and other wastes.",
    "Type 2 diabetes is a chronic condition that affects the way your body metabolizes sugar (glucose). It occurs when the body becomes resistant to insulin or doesn't produce enough insulin, leading to high blood sugar levels.",
    "Bacterial infections are caused by bacteria, while viral infections are caused by viruses. Antibiotics can be used to treat bacterial infections, but they are not effective against viral infections, which usually need to run their course or be treated with antiviral medications.",
    "High blood pressure, or hypertension, is a condition in which the force of blood against the artery walls is consistently too high, increasing the risk of heart disease, stroke, and other health problems.",
    "Asthma is a chronic disease that affects the airways (tubes) that carry air in and out of your lungs. When you have asthma, the airways become inflamed and narrowed, making it hard to breathe.",
    "Gastroesophageal reflux disease (GERD) is a chronic condition where the stomach contents flow back up into the esophagus, causing symptoms like heartburn, chest pain, and difficulty swallowing.",
    "Irritable bowel syndrome (IBS) is a common disorder that affects the large intestine. Signs and symptoms include cramping, abdominal pain, bloating, gas, and diarrhea or constipation, or both.",
    "Migraine is a neurological condition characterized by recurrent moderate to severe headaches, often accompanied by nausea, vomiting, and sensitivity to light and sound.",
    "Osteoarthritis is the most common form of arthritis, affecting millions of people worldwide. It occurs when the protective cartilage that cushions the ends of your bones wears down over time, causing pain, stiffness, and loss of joint function."
]


In [118]:
# Call the main function with the sample_medical_dataset, tokenizer, model, and device
main(training_medical_dataset, tokenizer, model, device)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Generated Instructions:
What is the common cold?
What is the role of the heart in the human body?
What are the potential long-term complications of type 2 diabetes?
"What would happen if a person's blood sugar levels were to suddenly drop to dangerously low levels?"
What is the main difference between bacterial and viral infections and how do they respond to antibiotics?
What are the typical requirements for individuals with a viral infection to recover?
What is the definition of hypertension?
How does asthma affect the airways tubes that carry air in and out of the lungs?
What is the medical term for when the stomach contents flow back up into the esophagus, causing symptoms like heartburn, chest pain and difficulty swallowing?
What is the definition of irritable bowel syndrome (IBS)?
What is the definition of a migraine?
What is the relationship between osteoarthritis and the protective cartilage that cushions the ends of bones?
What is the relationship between joint stiffness and lo