## Medical Assistant Bot Assignment

Installing all the libraries required

In [1]:
!pip install -U bitsandbytes
!pip install transformers==4.36.2
!pip install -U peft
!pip install -U accelerate
!pip install -U trl
!pip install datasets==2.16.0
!pip install sentencepiece
!pip install sentence-transformers



### Mounting drive to get the dataset

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
import pandas as pd

In [4]:
df = pd.read_csv("/content/drive/MyDrive/intern_screening_dataset.csv")
df.head()

Unnamed: 0,question,answer
0,What is (are) Glaucoma ?,Glaucoma is a group of diseases that can damag...
1,What is (are) Glaucoma ?,The optic nerve is a bundle of more than 1 mil...
2,What is (are) Glaucoma ?,Open-angle glaucoma is the most common form of...
3,Who is at risk for Glaucoma? ?,Anyone can develop glaucoma. Some people are a...
4,How to prevent Glaucoma ?,"At this time, we do not know how to prevent gl..."


In [5]:
df.shape

(16406, 2)

In [6]:
df['question'].value_counts()

question
What causes Causes of Diabetes ?                                         20
What is (are) High Blood Cholesterol ?                                   19
What is (are) Medicare and Continuing Care ?                             14
What is (are) Skin Cancer ?                                              13
What is (are) Stroke ?                                                   13
                                                                         ..
What are the symptoms of Metachondromatosis ?                             1
What is (are) Disseminated peritoneal leiomyomatosis ?                    1
What are the symptoms of Disseminated peritoneal leiomyomatosis ?         1
What causes Disseminated peritoneal leiomyomatosis ?                      1
How to diagnose Diabetic Neuropathies: The Nerve Damage of Diabetes ?     1
Name: count, Length: 14981, dtype: int64

### Initialising base model for finetuning

In [7]:
base_model = "mistralai/Mistral-7B-Instruct-v0.2"
new_model = "mistral7B_medical"

### Importing all the libraries

In [8]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,HfArgumentParser,TrainingArguments,pipeline, logging
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
import os,torch
from datasets import load_dataset
from trl import SFTTrainer
import pandas as pd
import pyarrow as pa
import pyarrow.dataset as ds
from datasets import Dataset
import re

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [9]:
from google.colab import userdata
secret_hf = userdata.get('HF_TOKEN')
!huggingface-cli login --token $secret_hf

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


### Format for Mistral-
### *there is no system message

```
<s>[INST] user prompt [/INST]
assistant response </s>
```

In [10]:
df.columns

Index(['question', 'answer'], dtype='object')

### Removing null values

In [11]:
df.isnull().sum()

question    0
answer      5
dtype: int64

In [12]:
df = df.dropna()

### Spliting the dataset

In [13]:
# train test split
from sklearn.model_selection import train_test_split
X = df['question']
y = df['answer']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [14]:
type(X_train)

In [15]:
X_train = X_train.tolist()
y_train = y_train.tolist()
X_test = X_test.tolist()
y_test = y_test.tolist()

### Tranforming the dataset into the required template

In [16]:
text = []
for i in range(len(X_train)):
  text.append('<s>[INST]' + X_train[i] +'[/INST]'+ y_train[i] + '</s>')

df = pd.DataFrame({'text':text})

In [17]:
df.head()

Unnamed: 0,text
0,<s>[INST]What is (are) Apert syndrome ?[/INST]...
1,<s>[INST]Who is at risk for Osteoporosis? ?[/I...
2,<s>[INST]what research (or clinical trials) is...
3,<s>[INST]What is (are) dihydropyrimidinase def...
4,<s>[INST]What are the symptoms of Brachydactyl...


In [18]:
# converting to dataset object
dataset = ds.dataset(pa.Table.from_pandas(df).to_batches())
dataset = Dataset(pa.Table.from_pandas(df))

In [19]:
# Loading base model
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.bfloat16,
    bnb_4bit_use_double_quant= False,
)
model = AutoModelForCausalLM.from_pretrained(
        base_model,
        load_in_4bit=True,
        quantization_config=bnb_config,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True,
)

model.config.use_cache = False
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()

# Loading tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.bos_token, tokenizer.eos_token



config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

('<s>', '</s>')

In [20]:
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj"]
)
model = get_peft_model(model, peft_config)

In [21]:
# Training Hyperparamter
training_arguments = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=10,
    logging_steps=1,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    max_steps= 100,
    group_by_length=True,
    lr_scheduler_type="constant")


In [22]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length= 256,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing= False,
)

Map:   0%|          | 0/13120 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)


In [23]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,2.7837
2,2.5007
3,1.9697
4,2.018
5,1.6607
6,1.5597
7,1.4957
8,1.2695
9,1.2623
10,1.2788




TrainOutput(global_step=100, training_loss=1.2014599841833116, metrics={'train_runtime': 1170.8387, 'train_samples_per_second': 0.342, 'train_steps_per_second': 0.085, 'total_flos': 3365787365867520.0, 'train_loss': 1.2014599841833116, 'epoch': 0.03})

In [24]:
# saving the model and freezing weights since we are going to evaluate next
trainer.model.save_pretrained(new_model)
model.config.use_cache = True
model.eval()



PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear4bit(
                (base_layer): Line

### Function to make prompts from question

In [25]:
def build_prompt(question):
  prompt=f"<s>[INST]{question}[/INST]"
  return prompt

### Building the pipeline

In [26]:
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=170)

The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PersimmonF

### Testing with examples

### Common questions with multiplte instances in dataset

In [27]:
question = "what is Glaucoma ?"
prompt = build_prompt(question)
result = pipe(prompt)

print(result[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s>[INST]what is Glaucoma ?[/INST]Glaucoma is a group of eye conditions that damage the optic nerve and cause vision loss. The most common type of glaucoma is primary open-angle glaucoma, in which the fluid in the eye does not flow properly. The condition is usually painless and causes no symptoms until vision loss is well underway. The condition is diagnosed with a painless test, and treatment can slow or stop the progression of the disease. Other types of glaucoma include closed-angle glaucoma, in which the fluid in the eye is unable to flow through the pupil, and secondary glaucoma, in which the condition is a complication of another disease or condition. Treatment for glaucoma includes medications


In [28]:
question = "What is (are) Medicare and Continuing Care ?"
prompt = build_prompt(question)
result = pipe(prompt)

print(result[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s>[INST]What is (are) Medicare and Continuing Care ?[/INST]Medicare and continuing care is a program that provides health care coverage for people who are 65 or older and have a chronic illness or disability. The program is designed to help people who need long-term care, which is care that is needed for a long time. Medicare and continuing care is a program that is offered by some states. It is not available in all states. If you are interested in this program, you should contact your state's Medicaid office to find out if it is available in your state. You can find your state's Medicaid office by visiting the Medicaid.gov website. Medicare and continuing care is a program that is designed to help people who need long-term care


In [29]:
question = "What is (are) Skin Cancer ?"
prompt = build_prompt(question)
result = pipe(prompt)

print(result[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s>[INST]What is (are) Skin Cancer ?[/INST]Skin cancer is the most common form of cancer. It is usually caused by exposure to ultraviolet (UV) radiation from the sun. The three main types of skin cancer are basal cell carcinoma, squamous cell carcinoma, and melanoma. Basal cell carcinoma is the most common type of skin cancer. It is usually found on the face, neck, or hands. Squamous cell carcinoma is the second most common type of skin cancer. It is usually found on the face, ears, or hands. Melanoma is the most dangerous type of skin cancer. It is usually found on the back, legs, or arms. It is important to find and treat skin cancer early. Treatment


In [30]:
question = "What are the treatments for Hearing Loss ?"
prompt = build_prompt(question)
result = pipe(prompt)

print(result[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s>[INST]What are the treatments for Hearing Loss ?[/INST]Treatment for hearing loss depends on the cause and the type of hearing loss. For example, if the hearing loss is due to a condition that can be treated, such as an ear infection, the hearing loss may improve once the condition is treated. If the hearing loss is due to a condition that cannot be treated, such as aging, a hearing aid may be recommended. In some cases, a cochlear implant may be recommended. A cochlear implant is a surgical procedure that involves placing an electronic device in the ear. The device is designed to help the person hear. A cochlear implant is recommended for people who have a severe hearing loss and who do not benefit from a hearing aid. A cochlear implant


### Questions which are most rare in the dataset

In [31]:
question = "What causes Disseminated peritoneal leiomyomatosis ?"
prompt = build_prompt(question)
result = pipe(prompt)

print(result[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s>[INST]What causes Disseminated peritoneal leiomyomatosis ?[/INST]The cause of disseminated peritoneal leiomyomatosis is unknown. It is thought to be related to hormonal factors. The condition is more common in women than in men. It is also more common in women who have had multiple pregnancies. The condition is not related to the use of hormonal contraceptives. The condition is not related to the use of hormonal contraceptives. The condition is not related to the use of hormonal contraceptives. The condition is not related to the use of hormonal contraceptives. The condition is not related to the use of hormonal contraceptives. The condition is not related to the use of horm


In [32]:
question = "What are the symptoms of Disseminated peritoneal leiomyomatosis ?"
prompt = build_prompt(question)
result = pipe(prompt)

print(result[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s>[INST]What are the symptoms of Disseminated peritoneal leiomyomatosis ?[/INST]Signs and symptoms of disseminated peritoneal leiomyomatosis include abdominal pain, abdominal swelling, and constipation. These symptoms may be caused by disseminated peritoneal leiomyomatosis or by other conditions. Check with your doctor if you have any of the following:         - Abdominal pain    - Abdominal swelling    - Constipation    NIH: National Institute of Child Health and Human Development    NIH: National Cancer Institute    NIH: National Institute of Diabetes and Digestive and Kidney Diseases    NIH: National Institute of Neurological


In [33]:
question = "How to diagnose Diabetic Neuropathies: The Nerve Damage of Diabetes ?"
prompt = build_prompt(question)
result = pipe(prompt)

print(result[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s>[INST]How to diagnose Diabetic Neuropathies: The Nerve Damage of Diabetes ?[/INST]How is diabetic neuropathy diagnosed? Diabetic neuropathy is diagnosed based on a person's symptoms and a physical exam. A doctor may also order tests to rule out other conditions that can cause nerve damage. These tests include a nerve conduction study, which measures the electrical activity of nerves, and a blood test to measure the levels of a protein called islet cell antibody. Islet cell antibodies are produced by the body's immune system and can be a sign of type 1 diabetes. A doctor may also order a blood test to measure the levels of a protein called glycosylated hemoglobin. Glycosyl


### It can be seen that the model performed quite well on both less familiar and more familiar questions.

### Function to extract answer from the response of the model

In [34]:
import re

# Example input
input_text = "<s>[INST]what is Glaucoma ?[/INST]Glaucoma is a group of diseases that damage the optic nerve.</s>"

# Define the regex pattern to extract the answer
pattern1 = r'\[\/INST\](.*?)<\/s>'
pattern2 = r'<s>\[INST\].*?\[/INST\](.*)'

# Use re.findall to extract the answer
matches1 = re.findall(pattern1, input_text)
matches2 = re.findall(pattern2, input_text)


# Check if any matches are found and extract the first one (assuming only one answer per prompt)
if matches1:
    answer1 = matches1[0].strip()
else:
    answer1 = None

if matches2:
    answer2 = matches2[0].strip()
else:
    answer2 = None

# Print the extracted answer
print(f"Extracted Answer1: {answer1}")
print(f"Extracted Answer2: {answer2}")

Extracted Answer1: Glaucoma is a group of diseases that damage the optic nerve.
Extracted Answer2: Glaucoma is a group of diseases that damage the optic nerve.</s>


In [35]:
input_text = "<s>[INST]what is Glaucoma ?[/INST]Glaucoma is a group of diseases that damage the optic nerve."

# Define the regex pattern to extract the answer
pattern1 = r'\[\/INST\](.*?)<\/s>'
pattern2 = r'<s>\[INST\].*?\[/INST\](.*)'

# Use re.findall to extract the answer
matches1 = re.findall(pattern1, input_text)
matches2 = re.findall(pattern2, input_text)


# Check if any matches are found and extract the first one (assuming only one answer per prompt)
if matches1:
    answer1 = matches1[0].strip()
else:
    answer1 = None

if matches2:
    answer2 = matches2[0].strip()
else:
    answer2 = None

# Print the extracted answer
print(f"Extracted Answer1: {answer1}")
print(f"Extracted Answer2: {answer2}")

Extracted Answer1: None
Extracted Answer2: Glaucoma is a group of diseases that damage the optic nerve.


In [36]:
def extract_answer(generated_text):
  pattern = r'<s>\[INST\].*?\[/INST\](.*)'
  matches = re.findall(pattern, input_text)

  if matches1:
    answer = matches[0].strip()
  else:
    answer = None

  return answer

### Evaluation

### Since i will need to perform multiple inferences from the model, i looked at the documentation and found that the most optimal way is to use a dataset to get inferences rather than manually iterating with a loop. But still it turned out to be very slow while inferenceing.

Documentation I followed: [This](https://huggingface.co/docs/transformers/en/pipeline_tutorial#using-pipelines-on-a-dataset) and [this](https://huggingface.co/docs/datasets/en/loading#python-dictionary)

In [37]:
from datasets import Dataset

my_dict = {"question": X_test}

dataset = Dataset.from_dict(my_dict)

In [38]:
from transformers.pipelines.pt_utils import KeyDataset

In [39]:
KeyDataset(dataset, "question")

<transformers.pipelines.pt_utils.KeyDataset at 0x7da54d05c7f0>

### Checking for answer format since initial run gave ```None``` as extracted answer

In [41]:
model_responses = []

for out in pipe(KeyDataset(dataset, "question")):
    print(out)
    break


    # answer = extract_answer(out[0]['generated_text'])
    # model_responses.append(answer)
    # print(answer)
    # break

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[{'generated_text': 'What is (are) harlequin ichthyosis ?\n # \n Harlequin ichthyosis is a rare, inherited condition that is characterized by the presence of a thick, scaly, and darkly pigmented skin. The condition is caused by mutations in the ATP2A2 gene and is inherited in an autosomal dominant pattern.  The ATP2A2 gene provides instructions for making a protein that is a type of ATP (adenosine triphosphate) pump. ATP pumps use energy from ATP to move ions, such as sodium and chloride, in and out of cells. The ATP2A2 protein is found in the outer layer of the skin, which is called the epidermis. The ATP2A2 protein is'}]


In [42]:
out[0]['generated_text']

'What is (are) harlequin ichthyosis ?\n # \n Harlequin ichthyosis is a rare, inherited condition that is characterized by the presence of a thick, scaly, and darkly pigmented skin. The condition is caused by mutations in the ATP2A2 gene and is inherited in an autosomal dominant pattern.  The ATP2A2 gene provides instructions for making a protein that is a type of ATP (adenosine triphosphate) pump. ATP pumps use energy from ATP to move ions, such as sodium and chloride, in and out of cells. The ATP2A2 protein is found in the outer layer of the skin, which is called the epidermis. The ATP2A2 protein is'

### we need a new function to extract answers since the format has changed

In [45]:
def extract_answer2(response):
    # Define the regular expression pattern
    pattern = r'\n # \n(.*)'

    # Use re.search to extract the answer
    match = re.search(pattern, response, re.DOTALL)

    if match:
        return match.group(1).strip()  # Return the captured group, stripped of any leading/trailing whitespace
    else:
        return None

In [47]:
model_responses = []

for out in pipe(KeyDataset(dataset, "question")):
    answer = extract_answer2(out[0]['generated_text'])
    model_responses.append(answer)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

KeyboardInterrupt: 

In [48]:
model_responses

['Harlequin ichthyosis is a rare, inherited condition that is characterized by the presence of a thick, scaly, and darkly pigmented skin. The condition is caused by mutations in the ATP2A2 gene and is inherited in an autosomal dominant pattern.  The ATP2A2 gene provides instructions for making a protein that is a type of ATP (adenosine triphosphate) pump. ATP pumps use energy from ATP to move ions, such as sodium and chloride, in and out of cells. The ATP2A2 protein is found in the outer layer of the skin, which is called the epidermis. The ATP2A2 protein is',
 None,
 'Vitelliform macular dystrophy is a rare, inherited, macular dystrophy that is characterized by the presence of a yellow, "egg-yolk"-like lesion in the macula. The macula is the part of the retina that is responsible for central vision. The yellow lesion is a lipofuscin-containing RPE (retinal pigment epithelium) \n # \n The RPE is a layer of cells that is located between the retina and the choroid. The RPE is responsible

### Since i could not extract answers from responses, I decided to use the whole of response as the answer as it would not make much of a difference to the embedding of the sentence.

### Also since inferenceing is taking a lot of time and i am running out of resources on colab, i will calculate the score for a very smaller test dataset of 50 inferences

In [75]:
X_test_new = [build_prompt(que) for que in X_test]

In [76]:
from datasets import Dataset

my_dict = {"question": X_test_new[:50]}

new_dataset = Dataset.from_dict(my_dict)

In [77]:
model_responses = []

for out in pipe(KeyDataset(new_dataset, "question")):
    model_responses.append(out[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

In [78]:
from sentence_transformers import SentenceTransformer, util
import numpy as np

# Load pre-trained model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

scores = []

for i in range(len(model_responses)):
  sentence1 = model_responses[i]
  sentence2 = y_test[i]

  if(sentence1 == None):
    scores.append(0)
    continue

  # Convert sentences to embeddings
  embedding1 = model.encode(sentence1, convert_to_tensor=True)
  embedding2 = model.encode(sentence2, convert_to_tensor=True)

  # Compute cosine similarity
  cosine_sim = util.pytorch_cos_sim(embedding1, embedding2)
  scores.append(cosine_sim.item())


print(f"Average Score: {np.mean(scores)}")



Average Score: 0.6984991261363029


### Final Score (Cosine Similarity) : 0.7

In [81]:
scores

[0.7351052761077881,
 0.8010175228118896,
 0.818317174911499,
 0.6687677502632141,
 0.8105291128158569,
 0.8071779012680054,
 0.7652386426925659,
 0.7760868072509766,
 0.7625764608383179,
 0.4735044538974762,
 0.317290723323822,
 0.7191007137298584,
 0.37834039330482483,
 0.7649561166763306,
 0.6071211099624634,
 0.77991783618927,
 0.7594385147094727,
 0.7186952829360962,
 0.7801324129104614,
 0.6837915182113647,
 0.6886070966720581,
 0.7476788759231567,
 0.7870322465896606,
 0.7819660305976868,
 0.7140623331069946,
 0.702784538269043,
 0.3751709461212158,
 0.8021848201751709,
 0.7922836542129517,
 0.8536555171012878,
 0.8368285894393921,
 0.39310622215270996,
 0.798186182975769,
 0.8057005405426025,
 0.7587932348251343,
 0.1905509978532791,
 0.6533782482147217,
 0.7970296740531921,
 0.8751619458198547,
 0.7510762214660645,
 0.46137019991874695,
 0.6947442293167114,
 0.7338381409645081,
 0.5416038036346436,
 0.8311755657196045,
 0.777381420135498,
 0.6635701656341553,
 0.76886552572250

In [79]:
model_responses

['<s>[INST]What is (are) harlequin ichthyosis ?[/INST]Harlequin ichthyosis is a rare, severe form of ichthyosis, a group of disorders characterized by dry, scaly skin. The condition is caused by mutations in the ABCA12 gene. The ABCA12 gene provides instructions for making a protein that is found in the outer layer of cells, which is called the epidermis. This protein is involved in the transport of cholesterol and other fats from the cell to the outer layer of the skin. Mutations in the ABCA12 gene prevent the normal function of this protein, which is essential for the normal development of the skin. The signs and symptoms of harlequin ichthyosis include thick, scaly skin that covers the entire',
 '<s>[INST]What is (are) Chilaiditi syndrome ?[/INST]Chilaiditi syndrome is a condition that affects the large intestine. The main symptom is a feeling of fullness in the abdomen, which is caused by the intestine being compressed by the sacrospinal ligament. The ligament is a band of tissue t