# Google Gemini Notes

## Models

* __gemini-pro__: Optimized for high intelligence tasks, the most powerful Gemini model
* __gemini-flash__: Optimized for multi-modal use-cases, where speed and cost are important
* __text-embedding__: Generates text embedding.
* __aqa__: Perform Attributed Question-Answering (AQA)–related tasks over a document, corpus, or a set of passages. The AQA model returns answers to questions that are grounded in provided sources, along with estimating answerable probability.

[Ref](https://ai.google.dev/gemini-api/docs/models/gemini)

## Generating Text

Use the pro model:

```python

model = genai.GenerativeModel('gemini-1.5-flash')

# generate_content handle various use cases, including multimodal input
response = model.generate_content("What is the meaning of life?")

# Responses re given in response.text
# You can use a method to convert the output to markdown
to_markdown(response.text)

# You can use response.prompt_feedback to understand why there was no response (e.g. there may be safety concerns)
response.prompt_feedback

# You can view multiple possible responses with response.candidates
response.candidates

# Responses can also be streamed, instead of waiting for the whole thing to be generating at once.
```

## Generating Text From Images and Text Inputs

```python

import PIL.Image

img = PIL.Image.open('image.jpg')


model = genai.GenerativeModel('gemini-1.5-flash')

response = model.generate_content(img)

to_markdown(response.text)

```


> This image shows two glass containers filled with prepared food...

You can also pass in a list of strings and images:

```python
response = model.generate_content(["Write a short, engaging blog post based on this picture. It should include a description of the meal in the photo and talk about my journey meal prepping.", img], stream=True)

response.resolve()

to_markdown(response.text)
```

> Meal prepping is a great way to save time and money, and it can also help you to eat healthier. 

## Chat Conversations

You can use the `ChatSession` class to manage conversation state.

```python
model = genai.GenerativeModel('gemini-1.5-flash')
chat = model.start_chat(history=[])
chat
```

```
ChatSession(
    model=genai.GenerativeModel(
        model_name='models/gemini-1.5-flash',
        generation_config={},
        safety_settings={},
        tools=None,
        system_instruction=None,
        cached_content=None
    ),
    history=[]
)
```

History can then be stored and received.

```python
response = chat.send_message("In one sentence, explain how a computer works to a young child.")
to_markdown(response.text)

chat.history
```

```
[parts {
   text: "In one sentence, explain how a computer works to a young child."
 }
 role: "user",
 parts {
   text: "A computer is like a very smart machine that can understand and follow our instructions, help us with our work, and even play games with us!"
 }
 role: "model"]
```

You can iterate around the history like this:

```python
for message in chat.history:
  display(to_markdown(f'**{message.role}**: {message.parts[0].text}'))
```

```


    user: In one sentence, explain how a computer works to a young child.

    model: A computer is like a very smart machine that can understand and follow our instructions, help us with our work, and even play games with us!

...
```

## Counting Tokens

Large language models have a context window, and the context length is often measured in terms of the number of tokens.

```python
model.count_tokens("What is the meaning of life?")
```

```
> total_tokens: 7
```

A token is equivalent to about 4 characters for Gemini models. 100 tokens are about 60-80 English words.

## Using Embeddings

Embedding is a way of representing text as a list of floats in a vector to compare and contrast embeddings. Texts that have similar subject matter or sentiment should have similar embeddings when comparing using e.g. cosine similarity.

```python
result = genai.embed_content(
    model="models/text-embedding-004",
    content=[
      'What is the meaning of life?',
      'How much wood would a woodchuck chuck?',
      'How does the brain work?'],
    task_type="retrieval_document",
    title="Embedding of list of strings")

# A list of inputs > A list of vectors output
for v in result['embedding']:
  print(str(v)[:50], '... TRIMMED ...')
```

```
> [0.0040260437, 0.004124458, -0.014209415, -0.00183 ... TRIMMED ...
```

Depending on what the text is being used for, you can set different task types:

Task Type | Description
---       | ---
RETRIEVAL_QUERY	| Specifies the given text is a query in a search/retrieval setting.
RETRIEVAL_DOCUMENT | Specifies the given text is a document in a search/retrieval setting. Using this task type requires a `title`.
SEMANTIC_SIMILARITY	| Specifies the given text will be used for Semantic Textual Similarity (STS).
CLASSIFICATION	| Specifies that the embeddings will be used for classification.
CLUSTERING	| Specifies that the embeddings will be used for clustering.

## Safety Settings

You can set safety settings to block potentially risky prompts. The following safety filters are available:

* Harassment
* Hate speed
* Sexually explicit
* Dangerous

Each of these categories has `HIGH` / `MEDIUM` / `LOW` / `NEGLIBIBLE` settings, and the API can be set to block each category at each of these settings.

## Customizable Paramters


Parameter | Description
---       | ---
Top p (probability) | The randomness or focus of the generated text. It specifies the probability distribution from which the next word is chosen during generation. Higher Top p (closer to 1): The model will choose the next word based on a more uniform probability distribution, leading to more creative and surprising but potentially less relevant outputs. Lower Top p (closer to 0): The model will prioritize the most likely continuations based on the current context, resulting in more predictable and relevant but potentially less creative outputs.
Top k (number) | Limits the number of possible continuations considered by the model when generating the next word. It acts as a filter, reducing the search space for the most likely next word. Higher Top k: The model considers a wider range of possibilities, potentially leading to more diverse and interesting outputs. Lower Top k: The model focuses on a smaller set of highly likely continuations, resulting in more consistent and focused outputs.
Temperature | Similar to Top p, controls the randomness of the generated text. However, it works by scaling the logits (log probabilities) of the candidate words before selecting the next one. Higher Temperature (greater than 1): Increases the randomness, making the model more likely to choose less probable but potentially more creative continuations. Lower Temperature (between 0 and 1): Decreases the randomness, favoring the most likely continuations and leading to more predictable outputs. Temperature of 1: Essentially acts like the original probability distribution.
Stop Sequence (string) | This parameter specifies a string or sequence of characters that signals the end of the text generation. Once the model encounters this sequence, it will stop generating further text. This is useful for controlling the length and focus of the generated content. For example, you might set the stop sequence to a specific punctuation mark (".", "?", "!") to indicate the end of a sentence or paragraph.
Max Output Length (number) | This parameter sets a hard limit on the maximum number of tokens (words or subwords) the model can generate. This helps prevent the generation of overly long or rambling outputs. It's useful when you need the generated text to be concise or fit within a specific word count.
Number of Response Candidates (number) | This parameter (potentially specific to certain use cases) determines how many candidate continuations the model generates for each step during the generation process.

## System Instructions

When you initialize an AI model, you can give it instructions on how to respond, such as setting a persona ("you are a rocket scientist") or telling it what kind of voice to use ("talk like a pirate"). 

You do this by setting the system instructions when you initialize the model.

Example system instruction use cases:

* Define a persona or role
* Define output format
* Define goals or rules
* Provide additional context (e.g. a knowledge date cutoff)

You set the instructions when you initialize the model, and then those instructions persist through all interactions with the model. The instructions persist across multiple user and model turns.

Example:

```python
instruction = (
    "You are a coding expert that specializes in front end interfaces. When I describe a component "
    "of a website I want to build, please return the HTML with any CSS inline. Do not give an "
    "explanation for this code."
)

model = genai.GenerativeModel(
    "models/gemini-1.5-flash", system_instruction=instruction
)

prompt = (
    "A flexbox with a large text logo aligned left and a list of links aligned right."
)

response = model.generate_content(prompt)
print(response.text)

from IPython.display import HTML

# Render the HTML
HTML(response.text.strip().removeprefix("```html").removesuffix("```"))
```

## Text Embeddings

Text embeddings convert text into coordinates (vectors) that can be plotted in n-dimensional space. This allows text to be treated as relational data, which we can train models on.

Embeddings capture semantic meaning and context, so similar sentences should have embeddings that are close to each other.

### Use Cases

#### Information Retrieval

Use embeddings to retrieve semantically similar text given some input text. E.g. semantic search system, answering questions, or summarization.

#### Classification

Train a model to classify documents into categories. For example, classify user comments as negative or positive or classify forum posts into categories.

#### Clustering

Train a model to cluster text together. For example, cluster forum posts in a mailing list.

#### Document Search

Create document embeddings and then send queries to find the text that contains the most relevant answer.

### Vector database

You can store generated embeddings in a vector DB to improve accuracy and efficiency (AlloyDB).

## Context Caching

At times, you might want to send the same input tokens over and over again for the model. E.g.

* Chatbots with extensive system instructions
* Repetitive analysis of large video files
* Recurring queries against large document sets
* Frequency code repo analysis

To save money, instead of sending these over and over, you can cache them. Cost is based on size and cache TTL.

## Tokens

LLMs break up input and produce output using tokens. Tokens could be a single character, or a whole word. A large word might be broken into several tokens.

In Gemini, a token is about 4 characters.

The price of a request is controlled the number of input and output tokens.

### Context Window

The amount of input and output the model can provide is known as the `context window`.

### Multi-modal Tokens

* __Images__: Images are internally a fixed size, so contain the same number of tokens
* __Audio/Video__: Audio/video are converted to tokens at a fixed rate of tokens per minute.



## Prompting

Prompts can contain:

* Input (req)
* Context (opt)
* Examples (opt)

Input can be further broken down into:

* Question
* Task ("Give me a list of...")
* Entity ("Classify / summarize...")
* Completion ("Some strategies to deal with writer's block include...")

### Prompting With Media

You can point the API directly at small local files, or upload files to the API for free.

You can upload image, audio, video, and plain text files (including Python, CSV, JSON, etc).

### Prompt Design Strategies

* Make instructions clear and specific
* Specify constraints (e.g. summarize in __two sentences__)
* Define response format (bulleted, table, etc)
* Provide few-shot examples (a few examples = few-shot, no examples = zero-shot)
* If the model requires a concise response, you can give examples showing it to prefer a precise response. E.g.

```
Question: Why is sky blue?
Explanation1: The sky appears blue because of Rayleigh scattering, which causes shorter blue
wavelengths of light to be scattered more easily than longer red wavelengths, making the sky look
blue.
Explanation2: Due to Rayleigh scattering effect.
Answer: Explanation2
```

* Find the optimal number of examples. providing too many examples may cause the model to overfit.
* Add contextual information. For example, if prompting to troubleshoot a router, you could include context information from the router's manual and include, "respond with only the text provided"
* Add prefixes
  * Input prefixes: E.g., English: , French:...
  * Output prefixes: E.g. JSON: signal answer should be in JSON, "The Answer is: ..."
  * Example prefix
* Let the model complete partial input
  * LLMs work like advanced autocomplete. Given partial input, it can complete it. Giving examples or context, the model will take those into account.
  
_prompt:_
```
For the given order, return a JSON object that has the fields cheeseburger, hamburger, fries, or
drink, with the value being the quantity.

Order: A burger and a drink.
```

_response:_
```
{
"cheeseburger": 0,
"hamburger": 1,
"fries": 0,
"drink": 1
}
```

Now giving it an example would cause it to exclude cheeseburger and fries.

* Prompt the model to format its response

_prompt:_
```
Create an outline for an essay about hummingbirds.
I. Introduction
*
```

_response:_
```
I. Introduction
* Capture the reader's attention with an interesting anecdote or fact about hummingbirds.
* Provide a brief background on hummingbirds, including their unique characteristics.
...
```

* Experiment with parameters (see earlier section)

### Breaking Down Prompts

For complex prompts, you can try to break them down.

#### Break down instructions

Instead of having many instructions in one prompt, create one prompt per instructions, and choose which one to process based on user input.

#### Chain prompts

For complex tasks with sequential steps, make each step a prompt and chain them together, where the output of one because the input of the next.

### Prompt Iteration

It can be necessary to iterate on prompts to find the right one.

* Use different phrasing
* Switch to an analogous task (e.g., instead of "which category..." try "multiple choice problem:..."
* Change the order of prompt content (e.g. instead of examples, context, input -- try: input, examples, context)

### File Prompting

Example of using images or other multimodal input:

* Write a blog post based on this image
* Get the schedule times in JSON from an image of a train platform signboard
* Solve math problem
* Put the table into markdown format
* Work out the ingredients in this dish
* Get information from product packaging, like rating or number of items in the box

## Semantic Retrieval (RAG)

RAG can be used to augment prompts sent to the LLM with data retrived through an IR (information retrieval) system.

The knowledge base can be your own corpora of docs, a DB, or APIs.

We can improve LLM's responses by augmenting it with the Semantic Retriever and AQA (Attributed Question and Answering) Gemini APIs.

The Semantic Retriever API lets you define up to 5 custom text corpora per project.

```python
example_corpus = glm.Corpus(display_name="Google for Developers Blog")
create_corpus_request = glm.CreateCorpusRequest(corpus=example_corpus)

# Make the request
create_corpus_response = retriever_service_client.create_corpus(create_corpus_request)

# Set the `corpus_resource_name` for subsequent sections.
corpus_resource_name = create_corpus_response.name
print(create_corpus_response)
```

```
name: "corpora/google-for-developers-blog-slfs22wtfhj8"
display_name: "Google for Developers Blog"
create_time {
  seconds: 1721076123
  nanos: 201645000
}
update_time {
  seconds: 1721076123
  nanos: 201645000
}
```

* Then we add `Document`s to a corpus. Documents can also have custom metadata, such as URLs.

```python
# Create a document with a custom display name.
example_document = glm.Document(display_name="Introducing Project IDX, An Experiment to Improve Full-stack, Multiplatform App Development")

# Add metadata.
# Metadata also supports numeric values not specified here
document_metadata = [
    glm.CustomMetadata(key="url", string_value="https://developers.googleblog.com/2023/08/introducing-project-idx-experiment-to-improve-full-stack-multiplatform-app-development.html")]
example_document.custom_metadata.extend(document_metadata)

# Make the request
# corpus_resource_name is a variable set in the "Create a corpus" section.
create_document_request = glm.CreateDocumentRequest(parent=corpus_resource_name, document=example_document)
create_document_response = retriever_service_client.create_document(create_document_request)

# Set the `document_resource_name` for subsequent sections.
document_resource_name = create_document_response.name
print(create_document_response)
```

### Chunking

To improve the relevance of content returned by the vector DB during semantic retrieval, documents can be broken down into __chunks__ while ingesting.

A `Chunk` is a subpart of a `Document` that is treated as an independent unit for the purpose of vector representation and store. It can have a max of 2043 tokens.

Google has it's own `HtmlChunker`, but others include `LangChain` and `LlamaIndex`.

### Quering

Finally, we can query the corpus. We can also set metadata filters to restrict the query to certain chunks, for example, filtering to date ranges or categories.

```python
user_query = "What is the purpose of Project IDX?"
results_count = 5

# Add metadata filters for both chunk and document.
chunk_metadata_filter = glm.MetadataFilter(key='chunk.custom_metadata.tags',
                                           conditions=[glm.Condition(
                                              string_value='Google For Developers',
                                              operation=glm.Condition.Operator.INCLUDES)])

# Make the request
# corpus_resource_name is a variable set in the "Create a corpus" section.
request = glm.QueryCorpusRequest(name=corpus_resource_name,
                                 query=user_query,
                                 results_count=results_count,
                                 metadata_filters=[chunk_metadata_filter])
query_corpus_response = retriever_service_client.query_corpus(request)
print(query_corpus_response)
```

### Attributed Question-Answering

Use `GenerateAnswer` to perform Attributed Question-Answering on your document, corpus, or set of passes.

This provides several advantages over an untuned LLM:

* The underlying model has been trained to return only answers that are grounded on the supplied context
* It identifies attributions, enabling the user to verify the answer
* It estimates `answerable_probability`

AQA is specialized for question-answering. For other use cases, such as summarization, etc., call the general model via `GenerateContent`.

```python
user_query = "What is the purpose of Project IDX?"
answer_style = "ABSTRACTIVE" # Or VERBOSE, EXTRACTIVE
MODEL_NAME = "models/aqa"

# Make the request
# corpus_resource_name is a variable set in the "Create a corpus" section.
content = glm.Content(parts=[glm.Part(text=user_query)])
retriever_config = glm.SemanticRetrieverConfig(source=corpus_resource_name, query=content)
req = glm.GenerateAnswerRequest(model=MODEL_NAME,
                                contents=[content],
                                semantic_retriever=retriever_config,
                                answer_style=answer_style)
aqa_response = generative_service_client.generate_answer(req)
print(aqa_response)
```

### Inline Passages

Alternatively, you can bypass the Semantic Retriever API by using `inline_passages`.

```python
user_query = "What is AQA from Google?"
user_query_content = glm.Content(parts=[glm.Part(text=user_query)])
answer_style = "VERBOSE" # or ABSTRACTIVE, EXTRACTIVE
MODEL_NAME = "models/aqa"

# Create the grounding inline passages
grounding_passages = glm.GroundingPassages()
passage_a = glm.Content(parts=[glm.Part(text="Attributed Question and Answering (AQA) refers to answering questions grounded to a given corpus and providing citation")])
grounding_passages.passages.append(glm.GroundingPassage(content=passage_a, id="001"))
passage_b = glm.Content(parts=[glm.Part(text="An LLM is not designed to generate content grounded in a set of passages. Although instructing an LLM to answer questions only based on a set of passages reduces hallucination, hallucination still often occurs when LLMs generate responses unsupported by facts provided by passages")])
grounding_passages.passages.append(glm.GroundingPassage(content=passage_b, id="002"))
passage_c = glm.Content(parts=[glm.Part(text="Hallucination is one of the biggest problems in Large Language Models (LLM) development. Large Language Models (LLMs) could produce responses that are fictitious and incorrect, which significantly impacts the usefulness and trustworthiness of applications built with language models.")])
grounding_passages.passages.append(glm.GroundingPassage(content=passage_c, id="003"))

# Create the request
req = glm.GenerateAnswerRequest(model=MODEL_NAME,
                                contents=[user_query_content],
                                inline_passages=grounding_passages,
                                answer_style=answer_style)
aqa_response = generative_service_client.generate_answer(req)
print(aqa_response)
```

## Generative vs. Deterministic

Q. Are generative models deterministic or random?

A. Both.

Explanation:

When you prompt the model, the text response is generated in two stages. The first stage, the model processes the input prompt and produces a __probability distribution__ over possible tokens that are likely to come next.


E.g., if you enter, "the dog jumped over the..."

`[("fence", 0.77), ("ledge", 0.12), ("blanket", 0.03), ...]`

This process is determinisitic, and the model will produce the same distribution every time.

In the second stage, the generative model converts these distribution to actual text responses through one of several decoding strategies.

The parameters, such as temperature, can control which token comes next, and help factor into the randomness.

## More Use Cases

* Vision: Get bounding boxes
* Vision: Transcribe video and get visual descriptions
* Audio: Get a transcript
* Text: Get structured JSON from unstructured data.

## Code Execution

Code Execution is tool that can be made available to the model.

```python
model = genai.GenerativeModel(
    model_name='gemini-1.5-pro',
    tools='code_execution')
```

The model will then decide when to use it. For example

```python
response = model.generate_content((
    'What is the sum of the first 50 prime numbers? '
    'Generate and run code for the calculation, and make sure you get all 50.'))

print(response.text)
```

```python
```python
def is_prime(n):
  """Checks if a number is prime."""
  if n <= 1:
    return False
  for i in range(2, int(n**0.5) + 1):
    if n % i == 0:
      return False
  return True

def sum_of_primes(n):
  """Calculates the sum of the first n prime numbers."""
  primes = []
  i = 2
  while len(primes) < n:
    if is_prime(i):
      primes.append(i)
    i += 1
  return sum(primes)

# Calculate the sum of the first 50 prime numbers
sum_of_first_50_primes = sum_of_primes(50)

print(f"The sum of the first 50 prime numbers is: {sum_of_first_50_primes}")

```

Alternatively, you can enable code execution in the prompt:

```python
response = model.generate_content(
    ('What is the sum of the first 50 prime numbers? '
    'Generate and run code for the calculation, and make sure you get all 50.'),
    tools='code_execution')
```

The model may look at any media files you pass in to the prompt, but it won't use them as part of the code.

Code execution and function calling are similar features:

* Code execution lets the model run code in the API backend in a fixed, isolated environment.

* Function calling lets you run the functions that the model requests, in whatever environment you want.
In general you should prefer to use code execution if it can handle your use case. Code execution is simpler to use (you just enable it) and resolves in a single GenerateContent request (thus incurring a single charge).

Function calling takes an additional GenerateContent request to send back the output from each function call (thus incurring multiple charges).

For most cases, you should use function calling if you have your own functions that you want to run locally, and you should use code execution if you'd like the API to write and run Python code for you and return the result

## Function Calling

Custom functions can be provided to Gemini models. 

The models do not directly invoke these functions, but generate structured data output that specifies the name and suggested args.

Function calling lets you interact with real-time information and services, such as DBs, CRMs, document repos and so on.

You use the Function Calling feature by adding structured query data describing programing interfaces, called function declarations, to a model prompt. 

The function declarations provide the name of the API function, explain its purpose, any parameters it supports, and descriptions of those parameters.

After you pass a list of function declarations in a query to the model, it analyzes function declarations and the rest of the query to determine how to use the declared API in response to the request.

The model then returns an object in an OpenAPI compatible schema specifying how to call one or more of the declared functions in order to respond to the user's question.

You can then take the recommended function call parameters, call the actual API, get a response, and provide that response to the user or take further action. 

It could also be used to extract structued data from text.

Note: This whole section is a bit confusing, but basically what I think it does is allows you to specify some functions, and then the API will return suggested calls to those functions when it think that is appropriate.

## JSON Mode

You can use JSON mode to...

* Build a DB of companies by pulling company information from newspaper articles
* Pull standardized info from resumes
* Extract ingredients from recipes an ddisplay a link to the grocery story website for each ingredient

You can force Gemini Pro to always respond with an expected structure by passing a JSON schema into `response_schema`.

For this to work, you define a class which represents the schema you want to return.

```python
class Recipe(typing.TypedDict):
  recipe_name: str

genai.configure(api_key=os.environ["API_KEY"])

# Using `response_schema` requires one of the Gemini 1.5 Pro models
model = genai.GenerativeModel('gemini-1.5-pro',
                              # Set the `response_mime_type` to output JSON
                              # Pass the schema object to the `response_schema` field
                              generation_config={"response_mime_type": "application/json",
                                                 "response_schema": list[Recipe]})

prompt = "List 5 popular cookie recipes"

response = model.generate_content(prompt)
print(response.text)
```

Note, the model sometimes be able to respond with JSON without specifying a schema. E.g.:

```python
prompt = """
  List 5 popular cookie recipes.
  Using this JSON schema:
    Recipe = {"recipe_name": str}
  Return a `list[Recipe]`
  """

```

## Fine Tuning

If prompts don't produce the results you want, __fine tuning__ can improve performance on a specific task, or help the model adhere to specific output requirements when you have a set of examples that show the output you want.

Fine tuning works by providing the model with a training dataset that contains many examples of the task. 

Training data should be structured as examples with prompt inputs and expected response outputs.

When you run a tuning job, the model learns additional parameters that help it encode the necessary information to perform the wanted task or learn the wanted behavior.

The output of the tuning job is a new model, which is effectively a combination of the newly learned parameters, and the original model.

You should target between 100-500 examples, although as little as 20 may work. Examples should be structured like real data. E.g., have `text_input` and `output` parameters.

Technically, this basically works like tuning any ML model: you set up the `epoch`s, `batch size`, `learning rate` etc.