# Two-Stage AI Pipeline: From Unstructured Text to Personalized Email

In this notebook, we'll build a practical AI pipeline that automates a recruiter's workflow:
1.  **Extract Data**: Turn unstructured LinkedIn profile text into structured JSON.
2.  **Generate Content**: Use that structured data to write a personalized outreach email.
3.  **Evaluate Quality**: Build an **LLM-as-a-Judge** to automatically score the personalization of our emails.

We will explore prompt engineering techniques (like schema definitions and persona constraints) to iterate from a basic baseline to a working prototype.

This will give us the building blocks for creating more reliable and testable AI applications.

## 1. Setup

First, let's set up our environment and API credentials.


In [1]:
import os
import json
from dotenv import load_dotenv
from google import genai

# Load environment variables from .env file
load_dotenv()

# Initialize the client (automatically finds GOOGLE_API_KEY)
client = genai.Client(api_key=os.getenv("GOOGLE_API_KEY"))

## 2. Load LinkedIn Profile Data

We'll use the LinkedIn profile text from Workshop 1 as our test data.


In [2]:
# Read the LinkedIn profile data
with open('../apps/data/hba.txt', 'r') as f:
    linkedin_profile = f.read()

print("Profile loaded! First 500 characters:")
print(linkedin_profile[:500])
print(f"\nTotal length: {len(linkedin_profile)} characters")


Profile loaded! First 500 characters:
Contact
hugobowne@gmail.com
www.linkedin.com/in/hugo-bowne-
anderson-045939a5 (LinkedIn)
hugobowne.github.io/ (Personal)
Hugo Bowne-Anderson
Data and AI scientist, consultant. writer, educator, machine learner,
podcaster.
Darlinghurst, New South Wales, Australia
Top Skills
Artificial Intelligence (AI)
Data Science
Developer Relations
Languages
French (Elementary)
Summary
I'm an independent data and AI scientist, consultant, writer, educator
& podcaster. My interests include promoting data & AI l

Total length: 4900 characters


## Part 1: LinkedIn Profile → Structured JSON

### Goal
Extract clean, structured JSON from messy LinkedIn profile text.

### Target Schema
We want to extract:
- `name`: Full name
- `current_title`: Current job title
- `location`: Current location
- `years_experience`: Approximate total years of experience
- `top_skills`: List of 3-5 key skills
- `recent_roles`: List of 2-3 most recent positions (title, company, duration)
- `education`: Highest degree and institution

Let's test 2 different prompt engineering approaches and see which works best.


### Variation 1: Minimal Prompt (Baseline)

Start simple - just ask for JSON extraction with no structure.


In [3]:
prompt_v1 = f"""Extract the key information from this LinkedIn profile as JSON:

{linkedin_profile}
"""

response_v1 = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=prompt_v1,
)

print("=== Variation 1: Minimal Prompt ===")
print(response_v1.text)


=== Variation 1: Minimal Prompt ===
```json
{
  "name": "Hugo Bowne-Anderson",
  "headline": "Data and AI scientist, consultant, writer, educator, machine learner, podcaster.",
  "location": "Darlinghurst, New South Wales, Australia",
  "contact_info": {
    "email": "hugobowne@gmail.com",
    "linkedin_url": "www.linkedin.com/in/hugo-bowne-anderson-045939a5",
    "personal_website_url": "hugobowne.github.io/"
  },
  "summary": "I'm an independent data and AI scientist, consultant, writer, educator & podcaster. My interests include promoting data & AI literacy/fluency, helping to spread data skills through organizations and society and lowering the barrier to entry for data science, analysis, and machine learning. Previously, I was Head of Developer Relations at Outerbounds, a company committed to building infrastructure that provides a solid foundation for machine learning applications of all shapes and sizes. I am also the host of the industry podcast Vanishing Gradients. I was previ

### Variation 2: With Schema Definition, JSON Mode, and Error Handling

This variation combines explicit field definitions, JSON mode, and missing data instructions.

In [4]:
prompt_v2 = f"""Extract structured information from the LinkedIn profile below.

Required fields:
- name (string): Full name of the person
- current_title (string): Most recent job title
- location (string): Current location
- years_experience (integer): Total years of professional experience (estimate if needed)
- top_skills (array): 3-5 key skills or areas of expertise
- recent_roles (array): 2-3 most recent positions with title, company, and duration
- education (string): Highest degree and institution

Important rules:
- If a field is missing or unclear, use null for strings or [] for arrays
- For years_experience, calculate from experience section dates
- Extract skills from the profile text, skills section, or infer from roles
- Return ONLY valid JSON, no markdown formatting

<linkedin_profile>
{linkedin_profile}
</linkedin_profile>
"""

response_v2 = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=prompt_v2,
    config={
        "response_mime_type": "application/json"
    }
)

print("=== Variation 2")
print(response_v2.text)

# Try to parse and validate
try:
    parsed_v2 = json.loads(response_v2.text)
    print("\nValid JSON!")
    print(f"\nExtracted fields: {list(parsed_v2.keys())}")
except json.JSONDecodeError:
    print("\n Invalid JSON")


=== Variation 2
{
  "name": "Hugo Bowne-Anderson",
  "current_title": "Independent Data and AI Scientist",
  "location": "Darlinghurst, New South Wales, Australia",
  "years_experience": 15,
  "top_skills": [
    "Artificial Intelligence (AI)",
    "Data Science",
    "Developer Relations"
  ],
  "recent_roles": [
    {
      "title": "Independent Data and AI Scientist",
      "company": "Freelance",
      "duration": "July 2024 - Present (1 year 6 months)"
    },
    {
      "title": "Head of Developer Relations",
      "company": "Outerbounds",
      "duration": "February 2022 - August 2024 (2 years 7 months)"
    },
    {
      "title": "Head of Data Science Evangelism and Marketing",
      "company": "Coiled",
      "duration": "May 2020 - October 2021 (1 year 6 months)"
    }
  ],
  "education": "Doctor of Philosophy (PhD) from UNSW Australia"
}

Valid JSON!

Extracted fields: ['name', 'current_title', 'location', 'years_experience', 'top_skills', 'recent_roles', 'education']


## Part 2: Structured Data → Personalized Recruiter Email

### Goal
Generate a personalized, professional recruiter outreach email from the structured candidate data.

### Job Position Context
Let's say we're recruiting for a "Senior AI/ML Engineer" position. We'll test different approaches to see what produces the best emails.


In [5]:
# First, let's extract the candidate data using our best prompt
def extract_profile_data(linkedin_text):
    """Extract structured JSON from LinkedIn profile text using our best prompt."""
    prompt = f"""Extract structured information from the LinkedIn profile below.

Required fields:
- name (string): Full name of the person
- current_title (string): Most recent job title
- location (string): Current location
- years_experience (integer): Total years of professional experience
- top_skills (array): 3-5 key skills or areas of expertise
- recent_roles (array): 2-3 most recent positions with title, company, and duration
- education (string): Highest degree and institution

Important rules:
- If a field is missing or unclear, use null for strings or [] for arrays
- For years_experience, calculate from experience section dates
- Extract skills from the profile text, skills section, or infer from roles
- Return ONLY valid JSON, no markdown formatting

<linkedin_profile>
{linkedin_text}
</linkedin_profile>
"""
    
    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=prompt,
        config={
            "response_mime_type": "application/json"
        }
    )
    
    return json.loads(response.text)

# Extract candidate data
candidate_data = extract_profile_data(linkedin_profile)
print("=== Extracted Candidate Data ===")
print(json.dumps(candidate_data, indent=2))


=== Extracted Candidate Data ===
{
  "name": "Hugo Bowne-Anderson",
  "current_title": "Independent Data and AI Scientist",
  "location": "Darlinghurst, New South Wales, Australia",
  "years_experience": 15,
  "top_skills": [
    "Artificial Intelligence (AI)",
    "Data Science",
    "Developer Relations"
  ],
  "recent_roles": [
    {
      "title": "Independent Data and AI Scientist",
      "company": "Freelance",
      "duration": "July 2024 - Present"
    },
    {
      "title": "Head of Developer Relations",
      "company": "Outerbounds",
      "duration": "February 2022 - August 2024"
    },
    {
      "title": "Head of Data Science Evangelism and Marketing",
      "company": "Coiled",
      "duration": "May 2020 - October 2021"
    }
  ],
  "education": "Doctor of Philosophy (PhD), UNSW Australia"
}


In [6]:
# Job description for context
job_description = """Senior AI/ML Engineer
Location: Remote (US-based)
Salary: $180k-$250k + equity

We're looking for an experienced AI/ML engineer to join our team building next-generation 
machine learning infrastructure. You'll work on scaling ML systems, building developer tools, 
and helping data scientists deploy models to production.

Requirements:
- 5+ years experience in ML/AI or data science
- Strong Python skills
- Experience with ML frameworks and infrastructure
- Track record of building tools that others love to use
"""

print("Job we're recruiting for:")
print(job_description)


Job we're recruiting for:
Senior AI/ML Engineer
Location: Remote (US-based)
Salary: $180k-$250k + equity

We're looking for an experienced AI/ML engineer to join our team building next-generation 
machine learning infrastructure. You'll work on scaling ML systems, building developer tools, 
and helping data scientists deploy models to production.

Requirements:
- 5+ years experience in ML/AI or data science
- Strong Python skills
- Experience with ML frameworks and infrastructure
- Track record of building tools that others love to use



### Email Variation 1: Minimal (Baseline)

Simple request with no specific guidance.


In [7]:
email_prompt_v1 = f"""Write a recruiter email to this candidate:

{json.dumps(candidate_data, indent=2)}

Job:
{job_description}
"""

email_v1 = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents=email_prompt_v1,
)

print("=== Email V1: Minimal ===")
print(email_v1.text)


=== Email V1: Minimal ===
Subject: Senior AI/ML Engineer Opportunity at [Your Company] - Building the Future of ML Infrastructure

Dear Hugo,

My name is [Your Name], and I'm a recruiter at [Your Company]. I came across your profile and was incredibly impressed with your extensive experience in AI and Data Science, particularly your expertise in Developer Relations and building tools for data scientists.

We're currently seeking a Senior AI/ML Engineer to join our growing team, and I believe your background aligns strongly with what we're looking for.

In this role, you'd be contributing to building next-generation machine learning infrastructure. This involves scaling ML systems, building developer tools, and enabling data scientists to deploy models into production efficiently. Your experience at Outerbounds and Coiled, focusing on developer relations and evangelism, combined with your deep knowledge in AI and data science, seems particularly relevant.

Specifically, we're looking fo

### Email Variation 2: With Guardrails and Personalization Requirements

This variation includes specific requirements and explicit guardrails about what NOT to include.

In [8]:
email_prompt_v2 = f"""You are a professional technical recruiter. Write a personalized outreach email.

Candidate:
{json.dumps(candidate_data, indent=2)}

Job:
{job_description}

Guidelines:
- Keep it under 250 words
- Reference 2-3 specific things from their background
- Explain why this role is a good fit for THEM (not just why they're good for us)
- Professional and respectful tone
- Include a clear subject line

DO NOT:
- Use phrases like "I hope this email finds you well" or "exciting opportunity"
- Include salary information (mention "competitive compensation" only)
- Make it sound like a form letter
- Oversell or use excessive exclamation marks
- Include your signature (just end with a question)
"""

email_v2 = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=email_prompt_v2,
)

print("=== Email V2 ===")
print(email_v2.text)


=== Email V2 ===
**Subject: Hugo – Scaling ML Systems & Developer Experience**

Hi Hugo,

Your extensive background as an Independent Data and AI Scientist, coupled with your impactful leadership in Developer Relations at Outerbounds and Data Science Evangelism at Coiled, immediately stood out.

Our team is building next-generation machine learning infrastructure, focusing on scaling ML systems and empowering data scientists to deploy models to production. Given your deep expertise in AI/ML and your proven track record in building and advocating for tools that empower developers, I believe you’d find a unique opportunity here. We're particularly seeking someone who can not only engineer robust systems but also champion the developer experience – an area where your DevRel leadership would be invaluable.

This is a Senior AI/ML Engineer role within a remote team, offering significant technical challenges and competitive compensation, where you could directly shape how data scientists int

## Complete Two-Stage Pipeline

Now let's combine both stages into a complete, reusable function.


In [9]:
def generate_recruiter_email_pipeline(linkedin_text, job_description):
    """Complete pipeline: LinkedIn text → Structured data → Personalized email"""
    
    # Stage 1: Extract structured data
    print("Stage 1: Extracting candidate data...")
    candidate_data = extract_profile_data(linkedin_text)
    print("Extracted candidate data")
    print(json.dumps(candidate_data, indent=2))
    
    # Stage 2: Generate email
    print("\nStage 2: Generating email...")
    email_prompt = f"""You are a professional technical recruiter. Write a personalized outreach email.

Candidate:
{json.dumps(candidate_data, indent=2)}

Job:
{job_description}

Guidelines:
- Keep it under 250 words
- Reference 2-3 specific things from their background
- Explain why this role is a good fit for THEM
- Professional and respectful tone
- Include a clear subject line

DO NOT:
- Use phrases like "I hope this email finds you well" or "exciting opportunity"
- Make it sound like a form letter
- Oversell or use excessive exclamation marks
- Include your signature
"""
    
    email_response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=email_prompt,
    )
    
    print("Generated email\n")
    return email_response.text

# Test the complete pipeline
print("=== COMPLETE PIPELINE TEST ===")
final_email = generate_recruiter_email_pipeline(linkedin_profile, job_description)
print("\n" + "="*60)
print("FINAL EMAIL:")
print("="*60)
print(final_email)


=== COMPLETE PIPELINE TEST ===
Stage 1: Extracting candidate data...
Extracted candidate data
{
  "name": "Hugo Bowne-Anderson",
  "current_title": "Independent Data and AI Scientist",
  "location": "Darlinghurst, New South Wales, Australia",
  "years_experience": 14,
  "top_skills": [
    "Artificial Intelligence (AI)",
    "Data Science",
    "Developer Relations"
  ],
  "recent_roles": [
    {
      "title": "Independent Data and AI Scientist",
      "company": "Freelance",
      "duration": "July 2024 - Present"
    },
    {
      "title": "Head of Developer Relations",
      "company": "Outerbounds",
      "duration": "February 2022 - August 2024"
    },
    {
      "title": "Head of Data Science Evangelism and Marketing",
      "company": "Coiled",
      "duration": "May 2020 - October 2021"
    }
  ],
  "education": "Doctor of Philosophy (PhD), UNSW Australia"
}

Stage 2: Generating email...
Generated email


FINAL EMAIL:
Subject: Hugo - Senior AI/ML Engineer: ML Infrastructure 

## LLM Judge

### Using an LLM to Evaluate Quality

How do we check if our pipeline is working?

*   **For Structured Data:** We can use **code-based checks** (e.g., `json.loads()` to verify the JSON is valid, or asserting that "years_experience" is an integer).
*   **For Content Quality:** We can't use code to check if an email is "polite" or "relevant." For this, we need **Judgment**.

**The Solution:** Use an LLM as a "Judge."

In this section, we will demonstrate this pattern by evaluating a set of example emails. We'll define a rubric and ask the LLM to score them.

**Exercise for the Reader:** After seeing how this works, try applying this judge to the actual output of the `generate_recruiter_email_pipeline` function we built above!

In [10]:
EVALUATOR_PROMPT = """
You are an expert evaluator assessing outputs from an outreach automation system
that drafts personalized emails based on a recipient’s LinkedIn profile.

Your Task:
Determine if the assistant-generated outreach email is appropriately personalized
and relevant to the recipient’s background and role, as described in their LinkedIn profile.

Evaluation Criterion:
Personalization and Relevance

Definition of Pass/Fail:
- Fail: The email is generic, irrelevant, or mismatched to the recipient’s background,
  interests, or role. It could have been sent to anyone.
- Pass: The email clearly demonstrates understanding of the recipient’s professional
  background, role, or achievements. It references specific details from the profile
  and establishes a relevant connection or value proposition.

Profile Dimensions to Consider:
- Current role and industry
- Skills, interests, or accomplishments
- Relevance of the email’s purpose to the recipient’s background

Output Format:
Return your evaluation as a JSON object with two keys:
1. reasoning: A brief explanation (1–2 sentences) for your decision.
2. answer: Either "Pass" or "Fail".

Examples:
---
Input 1:
LinkedIn Summary:
"Head of Data Science at FinEdge. I lead teams building predictive models for
credit scoring and fraud detection. Passionate about applying ML in finance."

Generated Email:
"Hey there! I’m reaching out to connect and share a few exciting updates from our AI team.
We’ve been working on some cool tools for startups and would love to get your feedback!"

Evaluation:
{
  "reasoning": "The email is generic and makes no reference to the recipient’s background in
  financial ML or leadership. It could have been sent to anyone.",
  "answer": "Fail"
}
---
Input 2:
LinkedIn Summary:
"Head of Data Science at FinEdge. I lead teams building predictive models for
credit scoring and fraud detection. Passionate about applying ML in finance."

Generated Email:
"Hi Alex, I saw your work leading the data science team at FinEdge on credit scoring and
fraud detection—really impressive. We’ve built a platform for managing ML model drift
in financial institutions, and I’d love to hear your thoughts."

Evaluation:
{
  "reasoning": "The email references the recipient’s leadership role and domain focus, and
  clearly connects the product to financial ML applications.",
  "answer": "Pass"
}
---
Now, evaluate the following:

LinkedIn Summary: {{LINKEDIN_SUMMARY_HERE}}
Generated Email: {{GENERATED_EMAIL_HERE}}

Your JSON Evaluation:
"""

In [11]:
# Example emails to evaluate
examples = [
    {
        "linkedin_summary": "Senior ML Engineer at TechCorp specializing in NLP and conversational AI. 5 years experience building production models.",
        "email": """Subject: Your NLP work at TechCorp

Hi Sarah,

I saw your work on conversational AI at TechCorp and was particularly impressed by your focus on production ML systems. We're building an evaluation platform specifically for NLP models in production, and your experience with both model development and deployment would be invaluable.

Would you be open to a brief conversation about the challenges you've faced with model evaluation?

Best,
Alex"""
    },
    {
        "linkedin_summary": "Senior ML Engineer at TechCorp specializing in NLP and conversational AI. 5 years experience building production models.",
        "email": """Subject: Exciting opportunity!

Hi there!

I hope this email finds you well! We have an amazing opportunity in AI/ML that would be perfect for talented engineers like yourself.

Our company is doing cutting-edge work and we'd love to chat. Let me know if you're interested!

Thanks,
Mike"""
    },
    {
        "linkedin_summary": "Senior ML Engineer at TechCorp specializing in NLP and conversational AI. 5 years experience building production models.",
        "email": """Subject: ML position

Hi,

We're hiring ML engineers. You have ML experience so thought I'd reach out. The role involves building models and working with data.

Interested?

Thanks,
Jen"""
    },
    {
        "linkedin_summary": "Senior ML Engineer at TechCorp specializing in NLP and conversational AI. 5 years experience building production models.",
        "email": """Subject: ML role

Hi Sarah,

We're hiring ML engineers and saw you have ML experience. The role involves building models and working with data.

Let me know if you're interested.

Thanks,
Tom"""
    }
]

In [12]:
from google.genai import types

for i, example in enumerate(examples, 1):
    print(f"\n{'='*60}")
    print(f"EMAIL {i}")
    print('='*60)
    print(example['email'])
    
    judge_prompt = EVALUATOR_PROMPT.replace("{{LINKEDIN_SUMMARY_HERE}}", example['linkedin_summary']).replace("{{GENERATED_EMAIL_HERE}}", example['email'])
    
    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=judge_prompt,
        config=types.GenerateContentConfig(response_mime_type='application/json')
    )
    
    evaluation = json.loads(response.text)
    print(f"\n✓ Judgment: {evaluation['answer']}")
    print(f"Reasoning: {evaluation['reasoning']}")


EMAIL 1
Subject: Your NLP work at TechCorp

Hi Sarah,

I saw your work on conversational AI at TechCorp and was particularly impressed by your focus on production ML systems. We're building an evaluation platform specifically for NLP models in production, and your experience with both model development and deployment would be invaluable.

Would you be open to a brief conversation about the challenges you've faced with model evaluation?

Best,
Alex

✓ Judgment: Pass
Reasoning: The email directly references the recipient's specialization in NLP and conversational AI, their company, and their experience with production ML systems, making a clear and relevant connection.

EMAIL 2
Subject: Exciting opportunity!

Hi there!

I hope this email finds you well! We have an amazing opportunity in AI/ML that would be perfect for talented engineers like yourself.

Our company is doing cutting-edge work and we'd love to chat. Let me know if you're interested!

Thanks,
Mike

✓ Judgment: Fail
Reasonin

## Summary

In this notebook, we moved beyond simple queries and built a **Two-Stage Pipeline**:

1.  **Structured Extraction:** We tamed unstructured text (LinkedIn profiles) into clean, usable JSON data.
2.  **Grounded Generation:** We used that structured data to generate highly personalized content (emails), using specific constraints to control the tone.
3.  **Automated Evaluation:** We explored the **LLM-as-a-Judge** pattern to automatically assess quality, allowing us to judge "fuzzy" metrics like relevance and tone that standard code tests miss.

This gives us the building blocks for creating more reliable and testable AI applications.