##### Copyright 2024 Google LLC.

# Prompting

This notebook is based on <a href='https://www.kaggle.com/code/markishere/day-1-prompting'>Day 1</a> material of Google's Generative AI course, and <a href = 'https://www.kaggle.com/whitepaper-prompt-engineering'>Google's whitepaper on Prompt Engineering</a> by Lee Boonstra

### Install the SDK

## Suggested template for documenting prompts
![image.png](attachment:image.png)

In [1]:
!pip install -U -q "google-generativeai>=0.8.3"

In [2]:
import google.generativeai as genai
from IPython.display import HTML, Markdown, display
from env import *

%load_ext autoreload
%autoreload 2

### Set up API key


In [3]:
GOOGLE_API_KEY = GOOGLE_API_KEY
genai.configure(api_key=GOOGLE_API_KEY)

### List of available models

In [4]:
# Retrieving the names of available models
for model in genai.list_models():
    print(model.name)

models/chat-bison-001
models/text-bison-001
models/embedding-gecko-001
models/gemini-1.0-pro-latest
models/gemini-1.0-pro
models/gemini-pro
models/gemini-1.0-pro-001
models/gemini-1.0-pro-vision-latest
models/gemini-pro-vision
models/gemini-1.5-pro-latest
models/gemini-1.5-pro-001
models/gemini-1.5-pro-002
models/gemini-1.5-pro
models/gemini-1.5-pro-exp-0801
models/gemini-1.5-pro-exp-0827
models/gemini-1.5-flash-latest
models/gemini-1.5-flash-001
models/gemini-1.5-flash-001-tuning
models/gemini-1.5-flash
models/gemini-1.5-flash-exp-0827
models/gemini-1.5-flash-002
models/gemini-1.5-flash-8b
models/gemini-1.5-flash-8b-001
models/gemini-1.5-flash-8b-latest
models/gemini-1.5-flash-8b-exp-0827
models/gemini-1.5-flash-8b-exp-0924
models/learnlm-1.5-pro-experimental
models/gemini-exp-1114
models/gemini-exp-1121
models/embedding-001
models/text-embedding-004
models/aqa


In [5]:
# Retrieving a detailed description of the models for comparison

import pandas as pd
import re

model_df = pd.DataFrame(columns = ['name',
                                    'base_model_id',
                                    'version',
                                    'display_name',
                                    'description',
                                    'input_token_limit',
                                    'output_token_limit',
                                    'supported_generation_methods',
                                    'temperature',
                                    'max_temperature',
                                    'top_p',
                                    'top_k'])

pattern = r"(\w+)\s*=\s*(.*?)(?=,\s*\w+=|\)$)"

for model in genai.list_models():
    model_text = str(model)
    model_dict = {match[0]: match[1].strip() for match in re.findall(pattern, model_text)}
    model_dict_listified = {key:[value] for key, value in model_dict.items()}
    model_df = pd.concat([model_df,pd.DataFrame(model_dict_listified)], ignore_index= True)

model_df.drop(columns = ['base_model_id'], inplace = True) # dropping as the values are null
model_df

Unnamed: 0,name,version,display_name,description,input_token_limit,output_token_limit,supported_generation_methods,temperature,max_temperature,top_p,top_k
0,'models/chat-bison-001','001','PaLM 2 Chat (Legacy)','A legacy text-only model optimized for chat c...,4096,1024,"['generateMessage', 'countMessageTokens']",0.25,,0.95,40.0
1,'models/text-bison-001','001','PaLM 2 (Legacy)','A legacy model that understands text and gene...,8196,1024,"['generateText', 'countTextTokens', 'createTun...",0.7,,0.95,40.0
2,'models/embedding-gecko-001','001','Embedding Gecko','Obtain a distributed representation of a text.',1024,1,"['embedText', 'countTextTokens']",,,,
3,'models/gemini-1.0-pro-latest','001','Gemini 1.0 Pro Latest',,30720,2048,"['generateContent', 'countTokens']",0.9,,1.0,
4,'models/gemini-1.0-pro','001','Gemini 1.0 Pro','The best model for scaling across a wide rang...,30720,2048,"['generateContent', 'countTokens']",0.9,,1.0,
5,'models/gemini-pro','001','Gemini 1.0 Pro','The best model for scaling across a wide rang...,30720,2048,"['generateContent', 'countTokens']",0.9,,1.0,
6,'models/gemini-1.0-pro-001','001','Gemini 1.0 Pro 001 (Tuning)',,30720,2048,"['generateContent', 'countTokens', 'createTune...",0.9,,1.0,
7,'models/gemini-1.0-pro-vision-latest','001','Gemini 1.0 Pro Vision',,12288,4096,"['generateContent', 'countTokens']",0.4,,1.0,32.0
8,'models/gemini-pro-vision','001','Gemini 1.0 Pro Vision',,12288,4096,"['generateContent', 'countTokens']",0.4,,1.0,32.0
9,'models/gemini-1.5-pro-latest','001','Gemini 1.5 Pro Latest',,2000000,8192,"['generateContent', 'countTokens']",1.0,2.0,0.95,40.0


A more detailed characteristics of the models can be found <a href='https://ai.google.dev/gemini-api/docs/models/gemini#gemini-1.5-flash'>here<a>

### Output text generation

By using the `gemini-1.5-flash` model.

In [6]:
flash_model = genai.GenerativeModel('gemini-1.5-flash')

In [7]:
# Text generation, first with text stream showing partial results, second with markdown
response = flash_model.generate_content("Explain how nuclear reactor works, without being too much technical",
                                        stream=True)
for chunk in response:
    print(chunk.text, end=" ")
    print("...", end = '')
Markdown(response.text)

Imagine ... a giant, controlled bonfire.  Instead of wood, a nuclear reactor uses uranium ..., a special element that releases incredible amounts of heat when its atoms split apart ( ...this is called nuclear fission).

This heat boils water, creating high-pressure steam.  That steam spins giant turbines, which are connected to generators.   ...These generators produce electricity, just like a power plant using coal or gas, but with a much more powerful "bonfire."

To keep the "bonfire ..." under control, the reactor uses control rods.  These rods absorb some of the neutrons that cause the uranium to split, slowing down the reaction and preventing it from becoming too intense (a meltdown).  The entire process is carefully monitored and ... managed by computers and technicians to ensure safety and efficient power generation.

The used uranium fuel is highly radioactive and needs to be carefully stored for many years to allow it to become less dangerous.
 ...

Imagine a giant, controlled bonfire.  Instead of wood, a nuclear reactor uses uranium, a special element that releases incredible amounts of heat when its atoms split apart (this is called nuclear fission).

This heat boils water, creating high-pressure steam.  That steam spins giant turbines, which are connected to generators.  These generators produce electricity, just like a power plant using coal or gas, but with a much more powerful "bonfire."

To keep the "bonfire" under control, the reactor uses control rods.  These rods absorb some of the neutrons that cause the uranium to split, slowing down the reaction and preventing it from becoming too intense (a meltdown).  The entire process is carefully monitored and managed by computers and technicians to ensure safety and efficient power generation.

The used uranium fuel is highly radioactive and needs to be carefully stored for many years to allow it to become less dangerous.


In [8]:
# Chat generation
chat = flash_model.start_chat(history=[
    {'role': 'user' , 'parts' : "Try to guess a fruit, in two questions"}
])
response = chat.send_message("I'll start with a hint : it's a tropical fruit")
print(response.text)

Is it round?



In [9]:
response = chat.send_message('Yes')
print(response.text)

Is it primarily yellow or orange when ripe?



In [10]:
response = chat.send_message('Yes')
print(response.text)

My guess is a mango.



In [11]:
response = chat.send_message('Yes')

In [12]:
#Show chat history
chat.history

[parts {
   text: "Try to guess a fruit, in two questions"
 }
 role: "user",
 parts {
   text: "I\'ll start with a hint : it\'s a tropical fruit"
 }
 role: "user",
 parts {
   text: "Is it round?\n"
 }
 role: "model",
 parts {
   text: "Yes"
 }
 role: "user",
 parts {
   text: "Is it primarily yellow or orange when ripe?\n"
 }
 role: "model",
 parts {
   text: "Yes"
 }
 role: "user",
 parts {
   text: "My guess is a mango.\n"
 }
 role: "model",
 parts {
   text: "Yes"
 }
 role: "user",
 parts {
   text: "Great!  I guessed it.\n"
 }
 role: "model"]

## Explore generation parameters



### Output length

Using `max_output_tokens` parameter, and prompt engineering. Note that 1 token is about 4 characters and does not equal word counts.

In [13]:
short_model = genai.GenerativeModel(
    'gemini-1.5-flash',
    generation_config=genai.GenerationConfig(max_output_tokens=200))

response = short_model.generate_content('Write a 198 word essay on the importance of chewing gum in modern society.')
print(response.text)

While seemingly trivial, chewing gum holds a surprisingly significant role in modern society, extending beyond simple oral hygiene.  Its function as a stress reliever is widely acknowledged; the repetitive act provides a distraction and outlet for nervous energy, particularly beneficial in high-pressure environments.  This contributes to improved focus and reduced anxiety, indirectly boosting productivity and well-being.

Furthermore, chewing gum stimulates saliva production, enhancing oral health by neutralizing acids and washing away food particles.  This contributes to the prevention of cavities and gum disease, reducing the need for extensive and costly dental treatments.  The resulting fresher breath also has positive social implications, fostering confidence in social interactions.

Beyond its functional benefits, the chewing gum industry contributes significantly to the economy, supporting numerous jobs in manufacturing, distribution, and marketing.  The diverse flavors and inno

### Temperature, top-k, and top-p

`temperature` parameter is used to define the degree of randomness in selecting tokens. 

In case of `gemini-1.5-flash` model, default value is set at 1.0, with 2.0	as the possible maximum value. 

Higher the temperature, higher number of candidate tokens to select from, resulting in more open ended and creative results. On the other hand, lower temperature restricts the number of possible tokens. Temperature 0, also referred to as 'greedy decoding', selects the token with the highest probability.  

To note that temperature doesn't provide any guarantees of randomness, but it can be used to somewhat "nudge" the output.

`top-K` indicates how many token options the model is going to consider. Default value is 40. 1 means choosing the most probable token (greedy decoding). 

`top-P` indicates the cumulative sum threshold of the tokens' probabilities from the most to least probable. Default value is 0.95. Higher value means more random responses. 

The order of execution in Gemini API is `top-K` -> `top-P` -> `temperature`

Now let's get random

In [14]:
from google.api_core import retry

# When running lots of queries, it's a good practice to use a retry policy so your code
# automatically retries when hitting Resource Exhausted (quota limit) errors.
retry_policy = {
    "retry": retry.Retry(predicate=retry.if_transient_error, initial=10, multiplier=1.5, timeout=300)
}

instruction = """Continue the story and add two sentences. New line for each sentence :
                It was a bright cold day in April, and the clocks were striking thirteen."""

print(f"Opening line from the book L'Étranger by Albert Camus : \
      \n Aujourd'hui, maman est morte.\n {'-'*25}")

params = [['greedy', 1, 0.1, 0], # name, top-K, top-P, temperature
          ['default', 40, 0.95, 1],
          ['random', 5000, 1, 2]]

for name, top_k, top_p, temp in params:
    model = genai.GenerativeModel('gemini-1.5-flash',
                                  generation_config=genai.GenerationConfig(temperature=temp,
                                                                           top_p=top_p,
                                                                           top_k=top_k,
                                                                           max_output_tokens=200))
    print(f'{name.capitalize()} model : top-K {top_k}, top_p {top_p}, temperature {temp}\n')
    response = model.generate_content(instruction, request_options=retry_policy)

    if response.parts:
        print(response.text, '-' * 25)

Opening line from the book L'Étranger by Albert Camus :       
 Aujourd'hui, maman est morte.
 -------------------------
Greedy model : top-K 1, top_p 0.1, temperature 0

Winston Smith, his chin nuzzled into his threadbare collar, hurried through the streets of London.  
A telescreen's watchful eye seemed to follow him from every shadowed alleyway.
 -------------------------
Default model : top-K 40, top_p 0.95, temperature 1

Winston Smith, his chin nuzzled deep in his threadbare collar, hurried through the glass doors of the Ministry of Truth.  
A telescreen, its ever-watchful eye gleaming, seemed to follow his every move.
 -------------------------
Random model : top-K 5000, top_p 1, temperature 2

Winston Smith, his chin nuzzled into his threadbare collar, hurried through the streets.


The telescreen's unsettling gaze seemed to follow him, a constant, vigilant intrusion.
 -------------------------


Interesting how the sentences begins with the same phrase "Winston Smith, his chin nuzzled into his threadbare collar, hurried", no matter the value of the parameters. The following sentences also show that the model is cognisant of the exerpt from 1984.  

The actual phrases in the book is : <i>Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him.</i>

## Prompting techniques

- Zero shot (General Prompting) : Providing a input directly (i.e. question, instruction) <i>without examples</i>

- One-shot : Provide one example that the model can use to imitate to complete the task 

- Few shot : Provide multiple examples to show the model a pattern to follow. Compared to one-shot, this approach increases the chance the model follows the pattern. Rule of thumb is to provide at least three to five examples.

Different prompting approaches 

- System prompting : Set overall context and and purpose. Define the model's fundamental capabilities and overarching purpose (also to avoid hallucinating).

- Contextual prompting: Provide specific details or backgrounds relevant to the task.

- Role prompting : Assign a specific character or identity. Frame the model's output style and voice.

### Chain of Thought (CoT)

Prompting technique for improving the model's reasoning capabilities by generating intermediate reasoning steps. As a result, the model generates more accurate answers, less prone to hallucination 

It typically gets better results, especially when combined with few-shot examples, but it also tends to cost more to run, due to the increased token count.

Based on the concept of CoT, Tree of Thoughts (ToT) utilizes multiple different reasoning paths simultaneously to obtain the output. 

### Self-consistency

While CoT uses greedy decoding strategy, self-consistency prompting improves reasoning and accuracy by generating multiple responses to the same prompt and then aggregates the responses and select the most consistent outcome.

### ReAct (reason & act)

ReAct combines external tools with the model's reasoning, allowing the model to perform actions, including interacting with external APIs to retrieve information. ReAct prompting works by combining reasoning and acting into a thought-action loop.

### Automatic Prompt Engineering

Prompting a model to generate more prompts, evaluate and repeat.

## Suggested template for documenting prompts
- Name
- Goal
- Model
- Temperature
- Token limit
- Top-K
- Top-P
- Prompt
- Output