#### Basic usage of the Bloom model for text prediction - zero-shot

##### Setup

In [None]:
# Bloom is part of the transformers library --> install it
!pip install --quite transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.22.1-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 4.0 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 47.2 MB/s 
Collecting huggingface-hub<1.0,>=0.9.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 42.0 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.9.1 tokenizers-0.12.1 transformers-4.22.1


In [None]:
# imports needed libraries
import torch
from transformers import AutoTokenizer,AutoModelForCausalLM # A general model for casual inferencing

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

In [None]:
# test GPU avialiablity - otherwise cpu

!nvidia-smi
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device


Sun Sep 25 12:18:42 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

device(type='cuda', index=0)

In [None]:
# if we have gpu bind all tensors to gpu, otherwise by default cpu

if 'cuda' in str(device):
  torch.set_default_tensor_type(torch.cuda.FloatTensor) # this will allocate all tensors  on cuda



#### Using bloom 3b for text prediction 

In [None]:
# define the tokenizer and model 

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-3b")

model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-3b") # here we use bloom 3b parameters

In [None]:
# define any text prompt / or a list of prompts 
# in this example we have text continuation task as well as general knowledge question/answering task mixed together
text_prompt = ["Albert Einstein won a Nobel prize for", "Otto Warburg won discovered" ]# ANyone can change this part

In [None]:
# let's look at how bloom tokenizes text
tokens = tokenizer.tokenize(text_prompt)
print(tokens)

['Albert', 'ĠEinstein', 'Ġwon', 'Ġa', 'ĠNobel', 'Ġprize', 'Ġfor', 'Who', 'Ġwas', 'ĠOtto', 'ĠWar', 'burg', '?']


In [None]:
# convert tokens to inpt ids and return pytorch tensors
input_ids = tokenizer (text_prompt, return_tensors="pt", padding=True)
input_ids

{'input_ids': tensor([[124190,  78426,  15974,    267,  41530, 127901,    613],
        [     3,  57647,   1620,  96624,  18692,  19616,     34]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
        [0, 1, 1, 1, 1, 1, 1]])}

In [None]:
# generate tokens as ids and review the generated ids
gen_text = model.generate(**input_ids, min_length=20, max_length=40, temperature=0.2)

In [None]:
# predict possible continuation for the provided prompts

predictions = []
for index, _ in enumerate(gen_text):
  predicted_text = tokenizer.decode(gen_text[index])
  # print(f"Paragraph {index} is: {predicted_text}\n")
  predictions.append(predicted_text)

In [None]:
# do some cleaning of text:
# 1. find the last "." and delete everything afterwords to have a clean sentense
# 2. remove the new line character \n to make it easier and review
# 3. remove <pad> tokens

for item, pred in enumerate(predictions):
  pred = pred[0:pred.rfind(".")+1] # truncate words beyond last period
  pred = pred.replace("\n", "") # remove newlines
  pred = pred.replace("<pad>", "") # remove padding tokens
  print(f"Paragraph {item}: {pred}\n")

Paragraph 0: Albert Einstein won a Nobel prize for his work on the theory of relativity. He was the first person to use the word “relativity” in his Nobel lecture.

Paragraph 1: Who was Otto Warburg?"- Otto Warburg was a German chemist who was the first to study the metabolism of cancer cells.



#### Using bloom 3b parameters for zero-shot Q&A

In [None]:
# define the tokenizer and model 

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-3b")

model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-3b") # here we use bloom 3b parameters

In [None]:
# text with several general knowledge questions

prompt = ["Who was Otto Warburg and what he is known for?", "What Prof. Thomas Seyfried theory about cancer as a metabolic disease says?",\
          "What is the role of the m-Tor pathway in cancer progression?", \
          "What are the best natural blood Glucose inhibitors?",
          "Is the theory of cancer as a metabolic disease correct or not?"]

In [None]:
# convert it into input_ids
input_ids = tokenizer(prompt, return_tensors='pt', padding=True)

In [None]:
# generate predicted ids that should now be answers
generated_ids = model.generate(**input_ids, min_length=20, max_length=80, temperature=0.2, repetition_penalty=1.1)

In [None]:
# transform it back to text

predictions = []
for ids in generated_ids:
  predicted_text = tokenizer.decode(ids)
  print(f"Q&A: {predicted_text}\n")
  predictions.append(predicted_text)

Q&A: <pad><pad><pad>Who was Otto Warburg and what he is known for? The German scientist who discovered the significance of glucose metabolism in cancer cells. He also developed a method to measure oxygen consumption by living organisms.
The first thing you should know about Otto Warburg is that he was born on July 25, 1875 in Vienna, Austria. His father was a physician and his mother was an artist. He

Q&A: What Prof. Thomas Seyfried theory about cancer as a metabolic disease says? What is the difference between metabolism and energy production?
The most important thing to know about metabolism is that it is an essential process in all living organisms, including humans.
Metabolism is the chemical reactions that take place inside cells to produce energy from food or other sources of fuel. Metabolism also helps us store nutrients for

Q&A: <pad>What is the role of the m-Tor pathway in cancer progression? The mTOR pathway has been implicated in a variety of cellular processes, including 

In [None]:
# do some cleaning of text:
# 1. find the last "." and delete everything afterwords to have a clean sentense
# 2. remove the new line character \n to make it easier and review
# 3. remove <pad> tokens

for item, pred in enumerate(predictions):
  pred = pred[0:pred.rfind(".")+1] # truncate words beyond last period
  pred = pred.replace("\n", "") # remove newlines
  pred = pred.replace("<pad>", "") # remove padding tokens
  print(f"Paragraph {item}: {pred}\n")

Paragraph 0: Who was Otto Warburg and what he is known for? The German scientist who discovered the significance of glucose metabolism in cancer cells. He also developed a method to measure oxygen consumption by living organisms.The first thing you should know about Otto Warburg is that he was born on July 25, 1875 in Vienna, Austria. His father was a physician and his mother was an artist.

Paragraph 1: What Prof. Thomas Seyfried theory about cancer as a metabolic disease says? What is the difference between metabolism and energy production?The most important thing to know about metabolism is that it is an essential process in all living organisms, including humans.Metabolism is the chemical reactions that take place inside cells to produce energy from food or other sources of fuel.

Paragraph 2: What is the role of the m-Tor pathway in cancer progression? The mTOR pathway has been implicated in a variety of cellular processes, including cell growth and proliferation.

Paragraph 3: Wh

#### Using bloom 3b parameters for general Q&A using Q&A tokenizer and model and employing few shot examples

## Note: This does not work - it is not yet traind for Q&A. IN a seperate notebook we will train it for Q&A

In [None]:
# define tokenizer and model and use one specific for question-answering

from transformers import AutoModelForQuestionAnswering, AutoModelForDocumentQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-3b")

model = AutoModelForQuestionAnswering.from_pretrained("bigscience/bloom-3b") # as we can see this does not work - it is nnot yet traind for Q&A

#### As we can see this does not work - it is nnot yet traind for Q&A. IN a seperate notebook we will train it for Q&A