## Understanding the Challenge Data

### The Mission (kinda cool actually)
We're helping young people in Brazil find green jobs. UNICEF partnership, climate action, meaningful careers - the whole deal. 
And because it's 2025 we will build AI agents that can sift through job descriptions and training programs, match them to people's profiles, and do it efficiently and ethically!

**The Brazilian green jobs landscape we're working with:**
- **Major cities**: São Paulo (finance & tech hub), Rio de Janeiro (energy & environment), Brasília (policy & government), Salvador (renewable energy), Recife (innovation centers)
- **Key sectors**: Renewable energy (solar, wind, hydro), sustainable agriculture, environmental consulting, green construction, waste management
- **Companies leading the charge**: Petrobras (transitioning to renewables), Vale (sustainable mining), Suzano (sustainable forestry), plus hundreds of green startups

We have a `data` directory with:
- **`jobs/`** - 200 job postings
- **`trainings/`** - 497 training programs

### Quick math reality check
697 items × however many personas we need to match = potentially expensive if we're not careful with API calls.

This is where being smart about it pays off. Literally.

In [1]:
# Let's see what we're working with
from pathlib import Path

# Count files and get basic statistics
jobs_dir = Path('../data/jobs')
trainings_dir = Path('../data/trainings')

job_files = list(jobs_dir.glob('*.md')) if jobs_dir.exists() else []
training_files = list(trainings_dir.glob('*.md')) if trainings_dir.exists() else []

print(f"Dataset Overview:")
print(f"Jobs: {len(job_files)}")
print(f"Trainings: {len(training_files)}")
print(f"Total items: {len(job_files) + len(training_files)}")

Dataset Overview:
Jobs: 200
Trainings: 497
Total items: 697


### Let's look at a job posting

In [2]:
# Helper function to peek at files
from IPython.display import Markdown, display

def display_markdown_file(path: str) -> None:
    """Display a markdown file in Jupyter - nothing fancy"""
    p = Path(path)
    if not p.exists():
        print(f"File not found: {p}")
        return
    content = p.read_text(encoding='utf-8', errors='ignore')
    display(Markdown(content))

In [5]:
# Display a sample job
display_markdown_file(job_files[0])

# Join Our Team as a Banking Operations Supervisor!

**Your Role:**
You'll oversee daily banking operations in Brasília, ensuring our processes run smoothly and meet regulatory standards. This position requires someone who can manage documentation workflows and maintain compliance across all operational areas.

**What We're Looking For:**
- **Tecnólogo degree** with at least **1 year of experience** in banking or financial operations
- **Intermediate skills** in managing document control systems and preparing regulatory reports
- **Basic understanding** of compliance oversight and regulatory requirements
- Someone who pays attention to detail and can work effectively with cross-functional teams
- **Fluent Portuguese** communication skills for daily operations and reporting

**Your Key Responsibilities:**
- Supervise document control processes to ensure accuracy and timeliness
- Prepare and review regulatory reports for submission to banking authorities
- Monitor compliance procedures and identify areas for improvement
- Coordinate with different departments to maintain operational efficiency

**Location:**
This position is based in **Brasília** and requires in-person presence for daily banking operations oversight.

**Ready to take the next step in your banking career?** This role offers the opportunity to develop your expertise in administrative management within the banking sector while working with regulatory frameworks that keep our operations compliant and efficient.

Apply now to join our banking operations team in Brasília.

### And a training program

In [4]:
# Display a sample training
display_markdown_file(training_files[0])

**Why take this course?**

This **Client Support Fundamentals** training will help you:

✅ Master essential client interaction techniques at a foundational level  
✅ Apply best practices for transparency and compliance  
✅ Strengthen your resume with a recognized credential

**Course Details:**
- **Duration:** 12 weeks
- **Format:** Online
- **Language:** Portuguese (Brazil)
- **Certification:** Yes

**Prerequisites:**
None

**Don't miss the chance to stand out—register today!**

### What you'll notice

Both jobs and trainings have:
- **Overview/Description** 
- **Location** (this matters for matching)
- **Prerequisites** (skills, experience levels)
- **Outcomes** (for trainings)

But here's the kicker: they're not consistently formatted. Some use different headers, different structures, different language. 
Our solution needs to handle this chaos gracefully. 

This is why we can't just use regex or simple parsing - we need something smarter: GenAI!