# Opportunity Search Agent

An AI agent that extracts scholarships, internships, and fellowships from web pages using Google's ADK and Gemini.

**Goal**: Find and extract structured opportunity data from URLs

**Input**: Direct webpage URLs (not search result pages)

**Output**: JSON array of opportunities with details like title, organization, deadline, eligibility, etc.

## 1. Setup & Configuration

### API Key Setup

In [None]:

# ============================================================================
# API KEY SETUP - SECURITY CRITICAL
# ============================================================================
# 
# ‚ö†Ô∏è IMPORTANT: Never hardcode API keys in production code!
# 
# For Google Colab:
#   1. Click the üîë (key) icon in the left sidebar
#   2. Add a secret named "GOOGLE_API_KEY"
#   3. Paste your Gemini API key as the value
#   4. The code below will securely retrieve it
#
# For local development:
#   - Use environment variables: os.getenv("GOOGLE_API_KEY")
#   - Or use python-dotenv with a .env file (never commit .env!)
#
# Get your API key from: https://makersuite.google.com/app/apikey
# ============================================================================

import os
from kaggle_secrets import UserSecretsClient

try:
    GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")
    os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY
    print("‚úÖ Gemini API key setup complete.")
except Exception as e:
    print(
        f"üîë Authentication Error: Please make sure you have added 'GOOGLE_API_KEY' to your Kaggle secrets. Details: {e}"
    )

‚úÖ Gemini API key setup complete.


### Install Dependencies

In [4]:
!pip install -q --upgrade google-generativeai pymongo
print("‚úÖ Dependencies installed.")

‚úÖ Dependencies installed.


### Import Libraries

In [5]:

# ============================================================================
# LIBRARY IMPORTS
# ============================================================================
# 
# Standard library imports for async operations, regex, and JSON handling
# Third-party imports for web scraping, data processing, and AI agents
# 
# Key libraries:
#   - google.adk: Agent Development Kit for building AI agents
#   - google.genai: Gemini LLM integration
#   - requests: HTTP library for fetching web content
#   - pandas: Data manipulation and CSV export
#   - pymongo: MongoDB database integration
# ============================================================================

import re
import json
import asyncio
import requests
import pandas as pd
from bs4 import BeautifulSoup

from google.adk.agents import LlmAgent
from google.adk.models.google_llm import Gemini
from google.adk.runners import Runner
from google.adk.sessions import InMemorySessionService
from google.adk.memory import InMemoryMemoryService
from google.genai import types

import pymongo
from pymongo.errors import ConnectionFailure, OperationFailure

print("‚úÖ All libraries imported successfully.")

  from google.cloud.aiplatform.utils import gcs_utils


‚úÖ All libraries imported successfully.


### Configuration Constants

In [6]:

# ============================================================================
# CONFIGURATION CONSTANTS
# ============================================================================
#
# Centralized configuration makes the code maintainable and easy to modify.
#
# Application Settings:
#   - APP_NAME: Identifier for the agent (used in logging and sessions)
#   - USER_ID: Unique user identifier (in production, this would be dynamic)
#
# MongoDB Settings:
#   - Update these with your actual MongoDB connection details
#   - For local: "mongodb_url"
#   - For Atlas: "mongodb+srv://username:password@cluster.mongodb.net/"
#
# Retry Configuration:
#   - Ensures resilience against temporary API failures
#   - Exponential backoff prevents overwhelming the API
#   - Retries on: 429 (rate limit), 500/503/504 (server errors)
# ============================================================================

# Application settings
APP_NAME = "opportunity_search_agent"
USER_ID = "user-123"

# MongoDB settings (update with your connection details)
MONGODB_CONNECTION_STRING = "mongodb_url"
MONGODB_DB_NAME = "db_name"
MONGODB_COLLECTION_NAME = "collection_name"

# Retry configuration for API calls
retry_config = types.HttpRetryOptions(
    attempts=5,
    exp_base=7,
    initial_delay=1,
    http_status_codes=[429, 500, 503, 504]
)

# Initialize services
session_service = InMemorySessionService()
memory_service = InMemoryMemoryService()

print("‚úÖ Configuration complete.")

‚úÖ Configuration complete.


## 2. Tool Definitions

### Web Content Fetcher

In [7]:

# ============================================================================
# TOOL 1: WEB CONTENT FETCHER
# ============================================================================
#
# This tool enables the agent to retrieve web page content for analysis.
#
# Design Decisions:
#   - Async implementation: Prevents blocking the event loop
#   - 10-second timeout: Prevents hanging on slow/unresponsive servers
#   - Graceful error handling: Returns empty string on failure
#   - Uses asyncio.to_thread: Makes synchronous requests.get non-blocking
#
# The agent can invoke this tool when it needs to fetch HTML content
# from a URL to extract opportunity information.
# ============================================================================

async def fetch_web_content(url: str) -> str:
    """Fetches the HTML content of a given URL."""
    try:
        response = await asyncio.to_thread(requests.get, url, timeout=10)
        response.raise_for_status()
        return response.text
    except requests.exceptions.RequestException as e:
        print(f"Error fetching content from {url}: {e}")
        return ""

# Define as ADK tool
fetch_web_content_function_declaration = types.FunctionDeclaration(
    name="fetch_web_content",
    description="Fetches the HTML content of a given URL.",
    parameters=types.Schema(
        type="object",
        properties={
            "url": types.Schema(type="string", description="The URL to fetch.")
        },
        required=["url"],
    ),
)

fetch_web_content_tool_object = types.Tool(
    function_declarations=[fetch_web_content_function_declaration]
)

print("‚úÖ Web content fetcher tool defined.")

‚úÖ Web content fetcher tool defined.


### MongoDB Uploader

In [8]:

# ============================================================================
# TOOL 2: MONGODB UPLOADER
# ============================================================================
#
# This tool enables persistent storage of extracted opportunities.
#
# Benefits:
#   - Long-term tracking and analysis
#   - Querying and filtering capabilities
#   - Integration with other applications
#   - Historical data for trend analysis
#
# Design Decisions:
#   - Synchronous implementation: pymongo is not async
#   - Connection validation: Checks connection before insert
#   - Bulk insert support: Efficient for multiple documents
#   - Comprehensive error handling: Different failure modes handled
# ============================================================================

def upload_to_mongodb(data: list[dict], connection_string: str, db_name: str, collection_name: str) -> str:
    """Uploads a list of dictionaries to a specified MongoDB collection."""
    if not data:
        return "No data provided to upload to MongoDB."

    try:
        client = pymongo.MongoClient(connection_string)
        client.admin.command('ismaster')
        
        db = client[db_name]
        collection = db[collection_name]

        if isinstance(data, list) and data:
            result = collection.insert_many(data)
            return f"Successfully uploaded {len(result.inserted_ids)} documents to MongoDB."
        elif isinstance(data, dict):
            result = collection.insert_one(data)
            return f"Successfully uploaded 1 document with ID {result.inserted_id} to MongoDB."
        else:
            return "No documents to insert. The provided data list is empty."

    except ConnectionFailure as e:
        return f"MongoDB Connection failed: {e}"
    except OperationFailure as e:
        return f"MongoDB Operation failed: {e}"
    except Exception as e:
        return f"An unexpected error occurred during MongoDB upload: {e}"

# Define as ADK tool
upload_to_mongodb_function_declaration = types.FunctionDeclaration(
    name="upload_to_mongodb",
    description="Uploads a list of dictionaries to a specified MongoDB collection.",
    parameters=types.Schema(
        type="object",
        properties={
            "data": types.Schema(
                type="array",
                items=types.Schema(type="object"),
                description="A list of dictionaries representing the documents to upload."
            ),
            "connection_string": types.Schema(
                type="string",
                description="The MongoDB connection string."
            ),
            "db_name": types.Schema(
                type="string",
                description="The name of the MongoDB database."
            ),
            "collection_name": types.Schema(
                type="string",
                description="The name of the MongoDB collection."
            )
        },
        required=["data", "connection_string", "db_name", "collection_name"],
    ),
)

upload_to_mongodb_tool_object = types.Tool(
    function_declarations=[upload_to_mongodb_function_declaration]
)

print("‚úÖ MongoDB uploader tool defined.")

‚úÖ MongoDB uploader tool defined.


## 3. Agent Initialization

In [9]:

# ============================================================================
# CONFIGURATION CONSTANTS
# ============================================================================
#
# Centralized configuration makes the code maintainable and easy to modify.
#
# Application Settings:
#   - APP_NAME: Identifier for the agent (used in logging and sessions)
#   - USER_ID: Unique user identifier (in production, this would be dynamic)
#
# MongoDB Settings:
#   - Update these with your actual MongoDB connection details
#   - For local: "mongodb://localhost:27017/"
#   - For Atlas: "mongodb+srv://username:password@cluster.mongodb.net/"
#
# Retry Configuration:
#   - Ensures resilience against temporary API failures
#   - Exponential backoff prevents overwhelming the API
#   - Retries on: 429 (rate limit), 500/503/504 (server errors)
# ============================================================================

# Initialize Gemini model with tools
gemini_model = Gemini(
    api_key=GOOGLE_API_KEY,
    retry_config=retry_config,
    tools=[fetch_web_content_tool_object, upload_to_mongodb_tool_object]
)

# Create agent
agent = LlmAgent(
    model=gemini_model,
    name=APP_NAME
)

# Create runner
runner = Runner(
    agent=agent,
    app_name=APP_NAME,
    session_service=session_service,
    memory_service=memory_service,
)

print("‚úÖ Agent initialized with web content retrieval and MongoDB upload tools.")

‚úÖ Agent initialized with web content retrieval and MongoDB upload tools.


## 4. Helper Functions

### Parse and Display Opportunities

In [10]:

# ============================================================================
# HELPER FUNCTION: PARSE AND DISPLAY OPPORTUNITIES
# ============================================================================
#
# This function bridges the agent's output and usable data.
#
# Responsibilities:
#   1. Parse JSON response from agent
#   2. Handle markdown code blocks (```json...```)
#   3. Validate and clean data
#   4. Display opportunities in readable format
#   5. Export to CSV for spreadsheet analysis
#   6. Return data for further processing
#
# Error Handling:
#   - Gracefully handles JSON parsing errors
#   - Provides informative error messages
#   - Returns empty list on failure (doesn't crash)
# ============================================================================

async def parse_and_display_opportunities(agent_response_text: str):
    """Parses the agent's response text, displays opportunities, and exports to CSV.
    Returns the list of opportunities if parsing is successful, otherwise an empty list.
    """
    print("\n### Parsing opportunities from agent's response")

    if not agent_response_text:
        print("Agent returned an empty response. Assuming no opportunities found.")
        return []

    try:
        # Clean markdown code blocks if present
        if agent_response_text.startswith('```json') and agent_response_text.endswith('```'):
            agent_response_text = agent_response_text[len('```json'):-len('```')].strip()

        opportunities = json.loads(agent_response_text)
    except json.JSONDecodeError as e:
        print(f"Error parsing JSON from agent response: {e}")
        print(f"Raw response:\n{agent_response_text}")
        return []

    if not opportunities:
        print("No opportunities found in the agent's response.")
        return []

    print(f"Found {len(opportunities)} opportunities:\n")
    for i, opportunity in enumerate(opportunities):
        print(f"--- Opportunity {i + 1} ---")
        print(f"Title: {opportunity.get('title', 'N/A')}")
        print(f"Organization: {opportunity.get('organization', 'N/A')}")
        print(f"Description: {opportunity.get('description', 'N/A')}")
        print(f"Deadline: {opportunity.get('deadline', 'N/A')}")
        print(f"Application Link: {opportunity.get('application_link', 'N/A')}")
        print(f"Eligibility Criteria: {opportunity.get('eligibility_criteria', 'N/A')}")
        print(f"Category: {opportunity.get('category', 'N/A')}")
        print(f"Full Date Application: {opportunity.get('full_date_application', 'N/A')}")
        print(f"Image URL: {opportunity.get('image_url', 'no_image_available')}")
        print("\n")

    # Export to CSV
    try:
        df_opportunities = pd.DataFrame(opportunities)
        csv_filename = "opportunities.csv"
        df_opportunities.to_csv(csv_filename, index=False)
        print(f"‚úÖ Opportunities successfully exported to {csv_filename}.")
    except Exception as e:
        print(f"Error exporting to CSV: {e}")

    return opportunities

print("‚úÖ Parse and display function defined.")

‚úÖ Parse and display function defined.


### Run Session

In [11]:

# ============================================================================
# HELPER FUNCTION: RUN SESSION
# ============================================================================
#
# This is the main orchestration function that:
#   1. Validates URLs (rejects search result pages)
#   2. Fetches web content
#   3. Constructs prompts for the agent
#   4. Streams agent responses
#   5. Returns final extracted data
#
# Design Decisions:
#   - Pre-validates URLs: More efficient than letting agent discover issues
#   - Explicitly fetches content: More control over the process
#   - Streams responses: Provides real-time feedback
#   - Maintains session context: Enables multi-turn conversations
#
# URL Validation:
#   - Rejects search engine result pages (Google, Bing, etc.)
#   - These pages don't contain structured opportunity data
#   - Provides clear error message to user
# ============================================================================

async def run_session(
    runner_instance: Runner, user_queries: list[str] | str, session_id: str = "default"
):
    """Helper function to run queries in a session and display responses, returning the final model response.
    This version includes a pre-check for search result URLs and explicitly fetches web content.
    """
    print(f"\n### Session: {session_id}")

    # Create or retrieve session
    try:
        session = await session_service.create_session(
            app_name=APP_NAME, user_id=USER_ID, session_id=session_id
        )
    except:
        session = await session_service.get_session(
            app_name=APP_NAME, user_id=USER_ID, session_id=session_id
        )

    # Convert single query to list
    if isinstance(user_queries, str):
        user_queries = [user_queries]

    final_response_text = ""

    for query_original in user_queries:
        # Extract URL from the query
        url_match = re.search(r'URL: (https?://[^\s]+)', query_original)
        if not url_match:
            print("No URL found in the query. Cannot proceed.")
            return "No URL found in the query."
        extracted_url = url_match.group(1)

        # Check if it's a search results page
        search_patterns = [
            'google.com/search',
            'bing.com/search',
            'duckduckgo.com/?q=',
            'yahoo.com/search',
            'ask.com/web'
        ]
        is_search_page = any(pattern in extracted_url for pattern in search_patterns)

        if is_search_page:
            print(f"\nUser > {query_original}")
            error_message = "I cannot extract opportunities from a search results page. Please provide a direct link to a webpage that lists scholarships, internships, or fellowships, not a search query page."
            print(f"Model: > {error_message}")
            final_response_text = error_message
        else:
            # Fetch content and process
            print(f"\nUser (Fetching content for {extracted_url})...")
            html_content = await fetch_web_content(extracted_url)

            if not html_content:
                print(f"Model: > Failed to retrieve content from {extracted_url}.")
                final_response_text = "[]"
            else:
                # Construct prompt for agent
                simplified_prompt = f"""Analyze the following HTML content to identify any scholarships, internships, or fellowships. For each opportunity found, extract the following key details:
- **title**: The name of the scholarship, internship, or fellowship.
- **organization**: The entity offering the opportunity.
- **description**: A brief overview of the opportunity.
- **deadline**: The application deadline (if specified).
- **application_link**: A direct link to the application page (if available).
- **eligibility_criteria**: Key requirements or qualifications for applicants.
- **category**: The type of opportunity (e.g., 'internship', 'fellowship', 'scholarship' including 'Undergraduate', 'Masters', 'PhD' if specified).
- **full_date_application**: The full application deadline date (e.g., 'YYYY-MM-DD').
- **image_url**: The URL of an image associated with the opportunity, or a default string like 'no_image_available' if not found.

Return the extracted information as a JSON array of objects. If no opportunities are found, return an empty JSON array: [].

Your final output must be ONLY the JSON array (or empty JSON array), and should NOT contain any other text, explanations, or Python code snippets. This is extremely important for successful parsing.

Here is the HTML content to analyze:\n\n```html\n{html_content}\n```
"""

                print(f"\nUser > (Agent processing HTML content...)")
                query_content = types.Content(role="user", parts=[types.Part(text=simplified_prompt)])

                async for event in runner_instance.run_async(
                    user_id=USER_ID, session_id=session.id, new_message=query_content
                ):
                    if event.is_final_response() and event.content and event.content.parts:
                        text = event.content.parts[0].text
                        if text and text != "None":
                            print(f"Model: > {text}")
                            final_response_text = text

    return final_response_text

print("‚úÖ Run session function defined.")

‚úÖ Run session function defined.


## 5. Execution & Testing

### Extract Opportunities from URL

In [12]:

# ============================================================================
# EXECUTION & TESTING
# ============================================================================
#
# This section demonstrates how to use the agent to extract opportunities.
#
# Process:
#   1. Provide a URL containing scholarships/internships/fellowships
#   2. Agent fetches and analyzes the HTML content
#   3. Extracts structured data (title, organization, deadline, etc.)
#   4. Returns JSON array of opportunities
#   5. Exports to CSV and optionally uploads to MongoDB
#
# Expected Output:
#   - Console display of found opportunities
#   - opportunities.csv file created
#   - Data uploaded to MongoDB (if configured)
# ============================================================================

# Set the URL to scrape
sample_url = "https://scholarshipscorner.website/"
user_query = f"Please find opportunities from the following URL: URL: {sample_url}"

# Run the agent
agent_final_response = await run_session(runner, user_query, session_id="opportunity_extraction")

# Display the final response
print("\n--- Agent's Final Response ---")
print(agent_final_response)

# Parse and display the results
extracted_opportunities = await parse_and_display_opportunities(agent_final_response)


### Session: opportunity_extraction

User (Fetching content for https://scholarshipscorner.website/)...

User > (Agent processing HTML content...)
Model: > ```json
[
  {
    "title": "Youth Future Summit 2026 in Geneva, Switzerland | YFS 2026",
    "organization": "Scholarships Corner",
    "description": "A global opportunity offering scholarships, fellowships, or internships. Refer to the application link for full details.",
    "deadline": "November 20, 2025",
    "application_link": "https://scholarshipscorner.website/youth-future-summit-switzerland/",
    "eligibility_criteria": "Refer to application link for details.",
    "category": [
      "Youth Forum",
      "Opportunity",
      "Conference"
    ],
    "full_date_application": "2025-11-20",
    "image_url": "https://scholarshipscorner.website/wp-content/uploads/2025/09/Youth-Future-Summit-2026-in-Geneva-Switzerland-YFS-2026.webp"
  },
  {
    "title": "Your Europe, Your Say! (YEYS) 2026 | Fully Funded Youth Event in Brussel

In [13]:
agent_final_response

'```json\n[\n  {\n    "title": "Youth Future Summit 2026 in Geneva, Switzerland | YFS 2026",\n    "organization": "Scholarships Corner",\n    "description": "A global opportunity offering scholarships, fellowships, or internships. Refer to the application link for full details.",\n    "deadline": "November 20, 2025",\n    "application_link": "https://scholarshipscorner.website/youth-future-summit-switzerland/",\n    "eligibility_criteria": "Refer to application link for details.",\n    "category": [\n      "Youth Forum",\n      "Opportunity",\n      "Conference"\n    ],\n    "full_date_application": "2025-11-20",\n    "image_url": "https://scholarshipscorner.website/wp-content/uploads/2025/09/Youth-Future-Summit-2026-in-Geneva-Switzerland-YFS-2026.webp"\n  },\n  {\n    "title": "Your Europe, Your Say! (YEYS) 2026 | Fully Funded Youth Event in Brussels",\n    "organization": "Scholarships Corner",\n    "description": "A global opportunity offering scholarships, fellowships, or internshi

In [14]:
extracted_opportunities = await parse_and_display_opportunities(agent_final_response)
extracted_opportunities = await parse_and_display_opportunities(agent_final_response)

# If parsing failed, try to recover a JSON array from the raw model response
if not extracted_opportunities and agent_final_response:
    match = re.search(r'(\[.*\])', agent_final_response, re.S)
    if match:
        json_text = match.group(1)
        try:
            extracted_opportunities = json.loads(json_text)
            print(f"‚úÖ Recovered {len(extracted_opportunities)} opportunities by extracting JSON array.")
            pd.DataFrame(extracted_opportunities).to_csv("opportunities.csv", index=False)
        except json.JSONDecodeError:
            # Try a simple cleanup (remove stray newlines/carriage returns) and retry
            cleaned = json_text.replace('\n', '').replace('\r', '')
            try:
                extracted_opportunities = json.loads(cleaned)
                print(f"‚úÖ Recovered {len(extracted_opportunities)} opportunities after cleaning JSON.")
                pd.DataFrame(extracted_opportunities).to_csv("opportunities.csv", index=False)
            except Exception as e:
                print(f"‚ùå Failed to parse recovered JSON: {e}")
    else:
        print("‚ùå No JSON array found in the agent response to recover.")

print(f"‚úÖ Extracted {len(extracted_opportunities)} opportunities from the webpage.")


### Parsing opportunities from agent's response
Error parsing JSON from agent response: Expecting value: line 1 column 1 (char 0)
Raw response:
```json
[
  {
    "title": "Youth Future Summit 2026 in Geneva, Switzerland | YFS 2026",
    "organization": "Scholarships Corner",
    "description": "A global opportunity offering scholarships, fellowships, or internships. Refer to the application link for full details.",
    "deadline": "November 20, 2025",
    "application_link": "https://scholarshipscorner.website/youth-future-summit-switzerland/",
    "eligibility_criteria": "Refer to application link for details.",
    "category": [
      "Youth Forum",
      "Opportunity",
      "Conference"
    ],
    "full_date_application": "2025-11-20",
    "image_url": "https://scholarshipscorner.website/wp-content/uploads/2025/09/Youth-Future-Summit-2026-in-Geneva-Switzerland-YFS-2026.webp"
  },
  {
    "title": "Your Europe, Your Say! (YEYS) 2026 | Fully Funded Youth Event in Brussels",
    "org

### Upload to MongoDB (Optional)

In [17]:
if extracted_opportunities:
    print("\n--- Uploading Opportunities to MongoDB ---")
    upload_status = upload_to_mongodb(
        data=extracted_opportunities,
        connection_string=MONGODB_CONNECTION_STRING,
        db_name=MONGODB_DB_NAME,
        collection_name=MONGODB_COLLECTION_NAME
    )
    print(f"MongoDB Upload Status: {upload_status}")
else:
    print("No opportunities extracted to upload to MongoDB.")

No opportunities extracted to upload to MongoDB.
