#### Use Bloom 3b for Q&A along with SetFit few shots examples and examine if it improves Q&A over baseline

##### Setup

In [1]:
# Bloom is part of the transformers library --> install it
!pip install --quiet transformers

In [2]:
# imports needed libraries
import torch
from transformers import AutoTokenizer,AutoModelForCausalLM # A general model for casual inferencing

In [3]:
# test GPU avialiablity - otherwise cpu

!nvidia-smi
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device


NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



device(type='cpu')

In [4]:
# if we have gpu bind all tensors to gpu, otherwise by default cpu

if 'cuda' in str(device):
  torch.set_default_tensor_type(torch.cuda.FloatTensor) # this will allocate all tensors  on cuda



#### Using bloom 3b for Q&A with reference

In [5]:
# define the tokenizer and model 

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-3b")

model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-3b") # here we use bloom 3b parameters

In [6]:
# define any text prompt / or a list of prompts / or read from file

# in this case we have a reference paragraph + question

with open ('/content/curcumin_research.txt') as f:
  text_prompt = f.read() 

In [7]:
# Have a look at the prompt
text_prompt

'The current study investigated the effects of nano-curcumin on the MCF7 cell line. The results indicated that nano-curcumin decreased cell proliferation by 83.6%, which was more than that achieved by cyclophosphamide (63.31%), adriamycin (70.75%), and 5-fluorouracil (75.04%).\n\n\n\nAccrording to the above, Is Curcumin more effective than cyclophosphamide in treating MCF7 cell line?\n\n\n'

In [8]:
# let's look at how bloom tokenizes text
tokens = tokenizer.tokenize(text_prompt)
print(tokens)

['The', 'Ġcurrent', 'Ġstudy', 'Ġinvestigated', 'Ġthe', 'Ġeffects', 'Ġof', 'Ġnano', '-c', 'urc', 'umin', 'Ġon', 'Ġthe', 'ĠM', 'CF', '7', 'Ġcell', 'Ġline', '.', 'ĠThe', 'Ġresults', 'Ġindicated', 'Ġthat', 'Ġnano', '-c', 'urc', 'umin', 'Ġdecreased', 'Ġcell', 'Ġproliferation', 'Ġby', 'Ġ83', '.', '6%', ',', 'Ġwhich', 'Ġwas', 'Ġmore', 'Ġthan', 'Ġthat', 'Ġachieved', 'Ġby', 'Ġcycl', 'op', 'hosph', 'amide', 'Ġ(', '63', '.', '31%', '),', 'Ġad', 'ri', 'amy', 'cin', 'Ġ(', '70', '.', '75%', '),', 'Ġand', 'Ġ5-', 'fluor', 'our', 'ac', 'il', 'Ġ(', '75', '.', '04', '%', ').Ċ', 'ĊĊĊ', 'Ac', 'cr', 'ording', 'Ġto', 'Ġthe', 'Ġabove', ',', 'ĠIs', 'ĠCur', 'c', 'umin', 'Ġmore', 'Ġeffective', 'Ġthan', 'Ġcycl', 'op', 'hosph', 'amide', 'Ġin', 'Ġtreating', 'ĠM', 'CF', '7', 'Ġcell', 'Ġline', '?ĊĊ', 'Ċ']


In [9]:
# convert tokens to inpt ids and return pytorch tensors
input_ids = tokenizer (text_prompt, return_tensors="pt")
input_ids

{'input_ids': tensor([[  2175,   6644,  12589,  68760,    368,  22886,    461,  98678,   5770,
         140298,  23595,    664,    368,    435,  27267,     26,   9436,   5008,
             17,   1387,   9649,  46196,    861,  98678,   5770, 140298,  23595,
          77567,   9436, 117062,   1331,  27206,     17,  27448,     15,   2131,
           1620,   3172,   4340,    861,  46025,   1331,  35845,    636, 203858,
         156465,    375,   8291,     17, 228382,   1013,    856,    850,  75837,
          28000,    375,   4193,     17, 102124,   1013,    530, 109354, 232985,
            610,    377,    354,    375,   4963,     17,  11969,      8,   2913,
          16783,  10098,   1376,  23454,    427,    368,   9468,     15,   4020,
          34913,     70,  23595,   3172,  24307,   4340,  35845,    636, 203858,
         156465,    361, 130095,    435,  27267,     26,   9436,   5008,   7076,
            189]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [10]:
# generate tokens as ids and review the generated ids
gen_text = model.generate(**input_ids, min_length=80, max_length=150, temperature=0.5)

In [11]:
gen_text

tensor([[  2175,   6644,  12589,  68760,    368,  22886,    461,  98678,   5770,
         140298,  23595,    664,    368,    435,  27267,     26,   9436,   5008,
             17,   1387,   9649,  46196,    861,  98678,   5770, 140298,  23595,
          77567,   9436, 117062,   1331,  27206,     17,  27448,     15,   2131,
           1620,   3172,   4340,    861,  46025,   1331,  35845,    636, 203858,
         156465,    375,   8291,     17, 228382,   1013,    856,    850,  75837,
          28000,    375,   4193,     17, 102124,   1013,    530, 109354, 232985,
            610,    377,    354,    375,   4963,     17,  11969,      8,   2913,
          16783,  10098,   1376,  23454,    427,    368,   9468,     15,   4020,
          34913,     70,  23595,   3172,  24307,   4340,  35845,    636, 203858,
         156465,    361, 130095,    435,  27267,     26,   9436,   5008,   7076,
            189,   2175,   9649,    461,    368,   3344,  12589,  30638,    861,
          98678,   5770, 140

In [12]:
# let's look at the predicted text

predicted_text = tokenizer.decode(gen_text[0])

In [13]:
# and the text
predicted_text

'The current study investigated the effects of nano-curcumin on the MCF7 cell line. The results indicated that nano-curcumin decreased cell proliferation by 83.6%, which was more than that achieved by cyclophosphamide (63.31%), adriamycin (70.75%), and 5-fluorouracil (75.04%).\n\n\n\nAccrording to the above, Is Curcumin more effective than cyclophosphamide in treating MCF7 cell line?\n\n\nThe results of the present study showed that nano-curcumin inhibited the expression of MMP-2 and MMP-9, which are the key enzymes in the degradation of extracellular matrix. The results of the present study also showed that nano-curcumin inhibited'

In [14]:
# show only the answer
answer = predicted_text[predicted_text.rfind("?")+1 :]
answer = answer.replace("\n", "") # remove newlines

answer

'The results of the present study showed that nano-curcumin inhibited the expression of MMP-2 and MMP-9, which are the key enzymes in the degradation of extracellular matrix. The results of the present study also showed that nano-curcumin inhibited'

In [15]:
# do some cleaning of text:
# 1. find the last "." and delete everything afterwords to have a clean sentense
# 2. remove the new line character \n to make it easier and review
# 3. remove <pad> tokens

# for item, pred in enumerate(predicted_text):
#   pred = pred[0:pred.rfind(".")+1] # truncate words beyond last period
#   pred = pred.replace("\n", "") # remove newlines
#   pred = pred.replace("<pad>", "") # remove padding tokens
#   print(f"Paragraph {item}: {pred}\n")

#### Using bloom 3b parameters for zero-shot Q&A

In [None]:
# define the tokenizer and model 

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-3b")

model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-3b") # here we use bloom 3b parameters

In [None]:
# text with several general knowledge questions

prompt = ["Who was Otto Warburg and what he is known for?", "What Prof. Thomas Seyfried theory about cancer as a metabolic disease says?",\
          "What is the role of the m-Tor pathway in cancer progression?", \
          "What are the best natural blood Glucose inhibitors?",
          "Is the theory of cancer as a metabolic disease correct or not?"]

In [None]:
# convert it into input_ids
input_ids = tokenizer(prompt, return_tensors='pt', padding=True)

In [None]:
# generate predicted ids that should now be answers
generated_ids = model.generate(**input_ids, min_length=20, max_length=80, temperature=0.2, repetition_penalty=1.1)

In [None]:
# transform it back to text

predictions = []
for ids in generated_ids:
  predicted_text = tokenizer.decode(ids)
  print(f"Q&A: {predicted_text}\n")
  predictions.append(predicted_text)

Q&A: <pad><pad><pad>Who was Otto Warburg and what he is known for? The German scientist who discovered the significance of glucose metabolism in cancer cells. He also developed a method to measure oxygen consumption by living organisms.
The first thing you should know about Otto Warburg is that he was born on July 25, 1875 in Vienna, Austria. His father was a physician and his mother was an artist. He

Q&A: What Prof. Thomas Seyfried theory about cancer as a metabolic disease says? What is the difference between metabolism and energy production?
The most important thing to know about metabolism is that it is an essential process in all living organisms, including humans.
Metabolism is the chemical reactions that take place inside cells to produce energy from food or other sources of fuel. Metabolism also helps us store nutrients for

Q&A: <pad>What is the role of the m-Tor pathway in cancer progression? The mTOR pathway has been implicated in a variety of cellular processes, including 

In [None]:
# do some cleaning of text:
# 1. find the last "." and delete everything afterwords to have a clean sentense
# 2. remove the new line character \n to make it easier and review
# 3. remove <pad> tokens

for item, pred in enumerate(predictions):
  pred = pred[0:pred.rfind(".")+1] # truncate words beyond last period
  pred = pred.replace("\n", "") # remove newlines
  pred = pred.replace("<pad>", "") # remove padding tokens
  print(f"Paragraph {item}: {pred}\n")

Paragraph 0: Who was Otto Warburg and what he is known for? The German scientist who discovered the significance of glucose metabolism in cancer cells. He also developed a method to measure oxygen consumption by living organisms.The first thing you should know about Otto Warburg is that he was born on July 25, 1875 in Vienna, Austria. His father was a physician and his mother was an artist.

Paragraph 1: What Prof. Thomas Seyfried theory about cancer as a metabolic disease says? What is the difference between metabolism and energy production?The most important thing to know about metabolism is that it is an essential process in all living organisms, including humans.Metabolism is the chemical reactions that take place inside cells to produce energy from food or other sources of fuel.

Paragraph 2: What is the role of the m-Tor pathway in cancer progression? The mTOR pathway has been implicated in a variety of cellular processes, including cell growth and proliferation.

Paragraph 3: Wh

#### Using bloom 3b parameters for general Q&A using Q&A tokenizer and model and employing few shot examples

## Note: This does not work - it is not yet traind for Q&A. IN a seperate notebook we will train it for Q&A

In [None]:
# define tokenizer and model and use one specific for question-answering

from transformers import AutoModelForQuestionAnswering, AutoModelForDocumentQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-3b")

model = AutoModelForQuestionAnswering.from_pretrained("bigscience/bloom-3b") # as we can see this does not work - it is nnot yet traind for Q&A

#### As we can see this does not work - it is nnot yet traind for Q&A. IN a seperate notebook we will train it for Q&A