## Gen AI Intensive Course Capstone 2025Q1

Hi there!

I'm proud to present my submission for the Gen AI Intensive Course Capstone Project. This has been a very exciting and interesting experience — my first-ever course in AI, and a fantastic introduction to the field!

For my project, I built an AI assistant that answers Pokémon-related questions using Retrieval-Augmented Generation (RAG) and the Gemini 2.0 Flash model. RAG was one of my favorite topics from the course, so I decided to combine it with something else I really enjoy: Pokémon!

The dataset I used was sourced from Kaggle’s NLP section and serves as the foundation for my vector database, providing the relevant information the model uses to generate accurate answers.

Enough said — let’s dive into the project!

## Setup

First, install ChromaDB and the Gemini API Python SDK.

In [12]:
!pip uninstall -qqy jupyterlab kfp 2>/dev/null  # Remove unused conflicting packages
!pip install -U -q "google-genai==1.7.0" "chromadb==0.6.3"

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.7/144.7 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m611.1/611.1 kB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m52.5 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m100.9/100.9 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.2/95.2 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m49.8 MB/s[0m eta [

In [13]:
from google import genai
from google.genai import types

genai.__version__

'0.2.2'

## Set up API key

The Kaggle secret add-on permits me save my API key from AI Studio.

In [14]:
from kaggle_secrets import UserSecretsClient

GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")

client = genai.Client(api_key=GOOGLE_API_KEY)

## Import and Load data

In [15]:
path = '/kaggle/input/pokdex-for-all-1025-pokemon-w-text-description/pokedex.csv'

In [16]:
import pandas as pd

db = pd.read_csv(path)
db.head()

Unnamed: 0,id,name,height,weight,hp,attack,defense,s_attack,s_defense,speed,type,evo_set,info
0,1,bulbasaur,7,69,45,49,49,65,65,45,"{grass,poison}",1,A strange seed was planted on its back at birt...
1,2,ivysaur,10,130,60,62,63,80,80,60,"{grass,poison}",1,"When the bulb on its back grows large, it appe..."
2,3,venusaur,20,1000,80,82,83,100,100,80,"{grass,poison}",1,The plant blooms when it is absorbing solar en...
3,4,charmander,6,85,39,52,43,60,50,65,{fire},2,"Obviously prefers hot places. When it rains, s..."
4,5,charmeleon,11,190,58,64,58,80,65,80,{fire},2,"When it swings its burning tail, it elevates t..."


## Feature Engineering
Check how the data is composed, add or delete any data that would organize or add value to the data. 

In [17]:
db.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         1025 non-null   int64 
 1   name       1025 non-null   object
 2   height     1025 non-null   int64 
 3   weight     1025 non-null   int64 
 4   hp         1025 non-null   int64 
 5   attack     1025 non-null   int64 
 6   defense    1025 non-null   int64 
 7   s_attack   1025 non-null   int64 
 8   s_defense  1025 non-null   int64 
 9   speed      1025 non-null   int64 
 10  type       1025 non-null   object
 11  evo_set    1025 non-null   int64 
 12  info       1025 non-null   object
dtypes: int64(10), object(3)
memory usage: 104.2+ KB


In [18]:
db["type"].value_counts() #Check if the data contains double typing

type
{normal}          75
{water}           74
{grass}           47
{psychic}         39
{fire}            36
                  ..
{flying,water}     1
{poison,bug}       1
{dark,normal}      1
{ghost,dark}       1
{poison,ghost}     1
Name: count, Length: 203, dtype: int64

Looking at the data and knowing that this data will be used to complement an AI model I will add a Base Stat Total feature. This will sum up all the pokemons stats to create an overall stat. If the AI is hot with questions like "What is the most powerful pokemon depending on stats?", looking at the Base Stat feature will be much easier tham takingt in consideration all other stats. 

In [19]:
# Function to add a BST to every Pokemon in the list by summing up all their stats
# Conviniently all their stats are together so all we have to do is sum all the rows
for stats in range(db.count()[0]):
    db.loc[stats, 'bst'] = db.iloc[stats, 4:10].sum()

db["bst"].head()

  for stats in range(db.count()[0]):


0    318.0
1    405.0
2    525.0
3    309.0
4    405.0
Name: bst, dtype: float64

In [20]:
db.tail() # Inspect the data

Unnamed: 0,id,name,height,weight,hp,attack,defense,s_attack,s_defense,speed,type,evo_set,info,bst
1020,1021,raging-bolt,52,4800,125,73,91,137,89,75,"{electric,dragon}",545,It's said to incinerate everything around it w...,590.0
1021,1022,iron-boulder,15,1625,90,120,80,68,108,124,"{rock,psychic}",547,It resembles a Pokémon described in a dubious ...,590.0
1022,1023,iron-crown,16,1560,90,72,100,122,108,98,"{steel,psychic}",546,It resembles a mysterious object introduced in...,590.0
1023,1024,terapagos,2,65,90,65,85,65,85,60,{normal},548,Terapagos protects itself using its power to t...,450.0
1024,1025,pecharunt,3,3,88,88,160,88,88,88,"{poison,ghost}",549,It feeds others toxic mochi that draw out desi...,600.0


In [21]:
db.info() # last look at the data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   id         1025 non-null   int64  
 1   name       1025 non-null   object 
 2   height     1025 non-null   int64  
 3   weight     1025 non-null   int64  
 4   hp         1025 non-null   int64  
 5   attack     1025 non-null   int64  
 6   defense    1025 non-null   int64  
 7   s_attack   1025 non-null   int64  
 8   s_defense  1025 non-null   int64  
 9   speed      1025 non-null   int64  
 10  type       1025 non-null   object 
 11  evo_set    1025 non-null   int64  
 12  info       1025 non-null   object 
 13  bst        1025 non-null   float64
dtypes: float64(1), int64(10), object(3)
memory usage: 112.2+ KB


## Embed the data
This is the step where we create an embedding class to embed the data into the chroma database. We will be using the method used in the lessons, with the same embedding model.

In [22]:
from chromadb import Documents, EmbeddingFunction, Embeddings
from google.api_core import retry

from google.genai import types


# Define a helper to retry when per-minute quota is reached.
is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})


class GeminiEmbeddingFunction(EmbeddingFunction):
    # Specify whether to generate embeddings for documents, or queries
    document_mode = True

    @retry.Retry(predicate=is_retriable)
    def __call__(self, input: Documents) -> Embeddings:
        if self.document_mode:
            embedding_task = "retrieval_document"
        else:
            embedding_task = "retrieval_query"

        response = client.models.embed_content(
            model="models/text-embedding-004",
            contents=input,
            config=types.EmbedContentConfig(
                task_type=embedding_task,
            ),
        )
        return [e.values for e in response.embeddings]

Originally, I embedded and added each data point to the Chroma database individually, which resulted in a total processing time of approximately 22 minutes. To optimize performance, I implemented a batch processing system using a default batch size of 100. This dramatically reduced the upload time to just 22 seconds, achieving a ~99% improvement in speed. This optimization significantly enhanced the system's efficiency and scalability for larger datasets.

In [23]:
import chromadb 
import time     
import json      

# Define the name of the ChromaDB collection
DB_NAME = "pokedex"

# Instantiate the embedding function using Gemini
embed_fn = GeminiEmbeddingFunction()
embed_fn.document_mode = True  # Enable document embedding mode

# Create a ChromaDB client and get or create a collection
chroma_client = chromadb.Client()
collection = chroma_client.get_or_create_collection(name=DB_NAME, embedding_function=embed_fn)


def add_data_to_chroma(df, batch_size=100):
    """
    Adds data to the ChromaDB collection in batches.

    Parameters:
    - df (pd.DataFrame): The DataFrame containing Pokémon data.
    - batch_size (int): Number of rows to process in each batch (default is 100).
    """

    # Iterate through the DataFrame in chunks of `batch_size`
    for i in range(0, len(df), batch_size):
        # Select the current batch and create a copy
        batch = df.iloc[i:i+batch_size].copy()

        # Rename specific columns to match expected metadata field names
        batch = batch.rename(columns={
            "s_attack": "special attack",
            "s_defense": "special defense",
            "evo_set": "evo set"
        })

        # Define the metadata fields to extract for each record
        metadata_fields = [
            "name", "type", "bst", "height", "weight", 
            "hp", "attack", "defense", "special attack", 
            "special defense", "speed", "evo set"
        ]

        # Convert numpy types (e.g., int64) to native Python types for JSON compatibility
        metadata = batch[metadata_fields].applymap(
            lambda x: x.item() if hasattr(x, "item") else x
        ).to_dict("records")

        try:
            # Add the batch to the ChromaDB collection
            collection.add(
                ids=batch["id"].astype(str).tolist(),      # Unique IDs for each record
                documents=batch["info"].tolist(),          # Document text for embedding
                metadatas=metadata                         # Metadata for filtering/search
            )
            print(f"Added batch {i//batch_size + 1}")
            time.sleep(1)  # Optional delay to avoid rate limits or overload
        except Exception as e:
            # Handle and print errors if the batch fails
            print(f"Error adding batch {i//batch_size + 1}: {str(e)}")


if isinstance(db, pd.DataFrame):
    start_time = time.perf_counter()  # Start timer
    add_data_to_chroma(db)            # Upload data to Chroma
    end_time = time.perf_counter()    # End timer
    print(f"Execution took {end_time - start_time:.6f} seconds")
else:
    print("Error: `db` must be a Pandas DataFrame")


  metadata = batch[metadata_fields].applymap(


Added batch 1


  metadata = batch[metadata_fields].applymap(


Added batch 2


  metadata = batch[metadata_fields].applymap(


Added batch 3


  metadata = batch[metadata_fields].applymap(


Added batch 4


  metadata = batch[metadata_fields].applymap(


Added batch 5


  metadata = batch[metadata_fields].applymap(


Added batch 6


  metadata = batch[metadata_fields].applymap(


Added batch 7


  metadata = batch[metadata_fields].applymap(


Added batch 8


  metadata = batch[metadata_fields].applymap(


Added batch 9


  metadata = batch[metadata_fields].applymap(


Added batch 10


  metadata = batch[metadata_fields].applymap(


Added batch 11
Execution took 23.103200 seconds


In [24]:
collection.count() #Make sure all the rows got uploaded

1025

In [25]:
# Check the first few records to inspect that the metadat was processed correcrtly
sample = collection.get(limit=3, include=["metadatas"])
for i, meta in enumerate(sample["metadatas"]):
    print(f"Record {i} metadata:", meta)

Record 0 metadata: {'attack': 49, 'bst': 318.0, 'defense': 49, 'evo set': 1, 'height': 7, 'hp': 45, 'name': 'bulbasaur', 'special attack': 65, 'special defense': 65, 'speed': 45, 'type': '{grass,poison}', 'weight': 69}
Record 1 metadata: {'attack': 62, 'bst': 405.0, 'defense': 63, 'evo set': 1, 'height': 10, 'hp': 60, 'name': 'ivysaur', 'special attack': 80, 'special defense': 80, 'speed': 60, 'type': '{grass,poison}', 'weight': 130}
Record 2 metadata: {'attack': 82, 'bst': 525.0, 'defense': 83, 'evo set': 1, 'height': 20, 'hp': 80, 'name': 'venusaur', 'special attack': 100, 'special defense': 100, 'speed': 80, 'type': '{grass,poison}', 'weight': 1000}


In [26]:
collection.peek()

{'ids': ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10'],
 'embeddings': array([[-0.04983467,  0.04766912, -0.03131578, ..., -0.00996424,
         -0.01859296, -0.00747153],
        [-0.00581232,  0.02732378, -0.02708765, ...,  0.01603996,
         -0.00270984, -0.06517149],
        [-0.0035329 ,  0.03909785, -0.03910772, ...,  0.01979461,
          0.01176705, -0.06416678],
        ...,
        [-0.00910684,  0.00555849, -0.04753857, ..., -0.02807259,
         -0.03442959, -0.02871817],
        [-0.03576144, -0.00292758, -0.0250475 , ..., -0.0051216 ,
          0.00544243, -0.01488772],
        [-0.03546435,  0.01431307, -0.05014285, ...,  0.0009232 ,
         -0.00165167, -0.03995217]]),
 'documents': ['A strange seed was planted on its back at birth. The plant sprouts and grows with this POKéMON.',
  'When the bulb on its back grows large, it appears to lose the ability to stand on its hind legs.',
  'The plant blooms when it is absorbing solar energy. It stays on the move to see

## Check the database by Quering manually before implementing the AI
Query using the database to look for similar entities

In [27]:
from IPython.display import Markdown

# Switch to query mode when generating embeddings.
embed_fn.document_mode = False
query = "Pikachu"
result = collection.query(query_texts=[query], n_results=1)
[all_passages] = result["documents"]

Markdown(all_passages[0])

It’s considered to be a good Pokémon for beginners because of its friendliness, but its disposition grows rougher as it grows up.

## Implementation of the Gemini AI
On this cell we implement various functions:


* flatten: This function works as helper function to check for nested list as flatten them. This had to be made because when creating the context for the AI the zip function used encountered probelms with nested lists.
* extract_pokemon_names: This function helps in identifying when the query has a Pokemon name in it. What it does is find potential names and check if the name is present in the "name" column. If a name is found it will return those names to later be used.
* get_query_embedding: This function embeds the user query into a vector representation using the Gemini embedding function. These embeddings are used for similarity searches in the vector database.
* search_pokemon: saves the names from the extract_pokemon_names function in a variables called "mentioned_names" and does a vector search to get a "base_result". If the variable "mentioned_names" does not contain a name the function returns the base result. In the case where the variable does contain a name it gets the data of those pokemon from the database. Finally it combines that data with the base result and returns this cobination.  


In [28]:
import re

def flatten(lst):
    flat = []
    for item in lst:
        if isinstance(item, list):
            flat.extend(item)
        else:
            flat.append(item)
    return flat

def extract_pokemon_names(query):
    # Match capitalized words (adjust based on your Pokémon list)
    potential_names = re.findall(r'\b[A-Z][a-z]+\b', query)
    # Lower case the results for next step
    if potential_names:
        potential_names_lower = [name.lower() for name in potential_names]
        
        # Check which exist in database
        existing_records = collection.get(
            where={"name": {"$in": potential_names_lower}},
            include=["metadatas"],
            limit = 10
        )
        
        # Get unique validated names
        valid_names = {meta["name"] for meta in existing_records["metadatas"]}
        
        return [name for name in potential_names_lower if name in valid_names]

    
def get_query_embedding(query):
    embed_fn.document_mode = False
    response = embed_fn([query])
    return response[0]

def search_pokemon(query, collection, top_n):
    # Extract mentioned Pokémon names
    mentioned_names = extract_pokemon_names(query)
    
    # Base vector search
    base_results = collection.query(
        query_embeddings=[get_query_embedding(query)],
        n_results=top_n
    )
    
    # If names mentioned, boost those results
    if mentioned_names:
        name_results = collection.get(
            where={"name": {"$in": mentioned_names}},
            include=["metadatas", "documents"]
        )
        
        combined = {
        "ids": flatten(base_results["ids"]) + flatten(name_results["ids"]),
        "metadatas": flatten(base_results["metadatas"]) + flatten(name_results["metadatas"]),
        "documents": flatten(base_results["documents"]) + flatten(name_results["documents"]),
    }

        return combined  # Return combined results
    
    return base_results

In [29]:
from pprint import pprint
def ask_gemini(query, collection):
    # Get relevant Pokémon from ChromaDB
    query = query.replace("\n", " ")
    results = search_pokemon(query, collection, top_n=3)

    if not results["documents"]:
        return "I couldn't find relevant information in the database."

    # Although we used the flatten helper function this makes sure no nested list is passed 
    #Fixed problems with nested loops
    metadatas = results["metadatas"]
    documents = results["documents"]

    # If nested (list of lists), flatten it
    if isinstance(metadatas[0], list):
        metadatas = metadatas[0]
    if isinstance(documents[0], list):
        documents = documents[0]


    # Format relevant Pokémon data as context for Gemini
    context = "\n\n".join([
        f"Name: {meta['name']}\n"
        f"Type: {meta['type']}\n"
        f"Description: {doc}\n"
        f"Height: {meta['height']}\n"
        f"Weight: {meta['weight']}\n"
        f"HP: {meta['hp']}\n"
        f"Attack: {meta['attack']}\n"
        f"Defense: {meta['defense']}\n"
        f"Special Attack: {meta['special attack']}\n"
        f"Special Defense: {meta['special defense']}\n"
        f"Speed: {meta['speed']}\n"
        f"Evo Set: {meta['evo set']}\n"
        f"BST: {meta['bst']}"
        for meta, doc in zip(metadatas, documents)
    ])
    
    # Use Gemini to generate an answer
    prompt = f"""**Pokémon Knowledge Assistant**
    You are an expert Pokémon analyst with access to verified data.
    
    **Available Data:**
    {context}
    Weight is measured in **hectograms**, convert it to **kilograms** when answering questions about weight
    Height is measured in **decimeters**, convert to **meters** when answering questions about height
    
    **Task:**
    1. Answer this question specifically: "{query}"
    2. Only use provided data when possible
    3. For stat comparisons, calculate exact differences
    4. For evolution questions, reference the evo set
    5. If unsure, say "According to my data..." 
    
    **Response Requirements:**
    - Be precise with numbers when available
    - Mention the source Pokémon name
    - Keep answers under 3 sentences unless complex
    - Stay in scope, when asked something outside the source do not answer the question
    - Answer politely, saying hi and hoping you answer will help the user, unless it is out of scope...apologize and explain. 
    - Be creative!!!
    
    Question: {query}
    """
    response = client.models.generate_content(
        model = "gemini-2.0-flash",
        contents = prompt)
    return response.text, context

In [30]:
# Question with no specific pokemon
question = "Who is the lightest pokemon?"
answer = ask_gemini(question, collection)
print("\n--------------------------------------------\n")
print(f"""
Question: {question}\n
Answer: \n{answer[0]}
\nContext used to answer: \n{answer[1]}
""")
print("\n--------------------------------------------\n")

# Specific data of a specific pokemon
question = "What is the special defense and special attack stat of Gengar?"
answer = ask_gemini(question, collection)
print("\n--------------------------------------------\n")
print(f"""
Question: {question}\n
Answer: \n{answer[0]}
\nContext used to answer: \n{answer[1]}
""")
print("\n--------------------------------------------\n")

# Specific data of a different pokemon
question = "Give me information on a Pokemon called Giratina."
answer = ask_gemini(question, collection)
print("\n--------------------------------------------\n")
print(f"""
Question: {question}\n
Answer: \n{answer[0]}
\nContext used to answer: \n{answer[1]}
""")
print("\n--------------------------------------------\n")

# Out of context question
question = "Tell me about a Nissan?"
answer = ask_gemini(question, collection)
print("\n--------------------------------------------\n")
print(f"""
Question: {question}\n
Answer: \n{answer[0]}
\nContext used to answer: \n{answer[1]}
""")
print("\n--------------------------------------------\n")


--------------------------------------------


Question: Who is the lightest pokemon?

Answer: 
Hi there! I hope I can help you find your answer!

Flittle is the lightest Pokémon on the list weighing in at 1.5 kilograms. Hawlucha weighs 21.5 kilograms and Litwick weighs 3.1 kilograms.


Context used to answer: 
Name: flittle
Type: {psychic}
Description: Flittle’s toes levitate about half an inch above the ground because of the psychic power emitted from the frills on the Pokémon’s belly.
Height: 2
Weight: 15
HP: 30
Attack: 35
Defense: 30
Special Attack: 55
Special Defense: 30
Speed: 75
Evo Set: 500
BST: 255.0

Name: litwick
Type: {ghost,fire}
Description: Litwick shines a light that absorbs the life energy of people and Pokémon, which becomes the fuel that it burns.
Height: 3
Weight: 31
HP: 50
Attack: 30
Defense: 55
Special Attack: 65
Special Defense: 55
Speed: 20
Evo Set: 308
BST: 275.0

Name: hawlucha
Type: {fighting,flying}
Description: Although its body is small, its proficient fi