# Two-Stage AI Pipeline: From Unstructured Text to Personalized Email

In this notebook, we'll build a practical AI pipeline that automates a recruiter's workflow:
1.  **Extract Data**: Turn unstructured LinkedIn profile text into structured JSON.
2.  **Generate Content**: Use that structured data to write a personalized outreach email.

We will explore prompt engineering techniques—like schema definitions, few-shot prompting, and persona constraints—to iterate from a basic baseline to a robust, production-ready solution.

## 1. Setup

First, let's set up our environment and API credentials.


In [15]:
import os
from dotenv import load_dotenv
from google import genai

# Load environment variables from .env file
load_dotenv()

# Initialize the client (automatically finds GOOGLE_API_KEY)
client = genai.Client(api_key=os.getenv("GOOGLE_API_KEY"))

Both GOOGLE_API_KEY and GEMINI_API_KEY are set. Using GOOGLE_API_KEY.


## 2. Load LinkedIn Profile Data

We'll use the LinkedIn profile text from Workshop 1 as our test data.


In [16]:
# Read the LinkedIn profile data
with open('../apps/data/hba.txt', 'r') as f:
    linkedin_profile = f.read()

print("Profile loaded! First 500 characters:")
print(linkedin_profile[:500])
print(f"\nTotal length: {len(linkedin_profile)} characters")


Profile loaded! First 500 characters:

hugo bowne-anderson

    About me
    Podcast
    education
    Writing
    Talks

data scientist - educator - writer - podcaster
portrait

I'm an independent data and AI scientist, consultant, writer, educator & podcaster. My interests include promoting data & AI literacy/fluency, helping to spread data skills through organizations and society and lowering the barrier to entry for data science, analysis, and machine learning. Previously, I was Head of Developer Relations at Outerbounds, a comp

Total length: 11579 characters


## Part 1: LinkedIn Profile → Structured JSON

### Goal
Extract clean, structured JSON from messy LinkedIn profile text.

### Target Schema
We want to extract:
- `name`: Full name
- `current_title`: Current job title
- `location`: Current location
- `years_experience`: Approximate total years of experience
- `top_skills`: List of 3-5 key skills
- `recent_roles`: List of 2-3 most recent positions (title, company, duration)
- `education`: Highest degree and institution

Let's test 5 different prompt engineering approaches and see which works best.


### Variation 1: Minimal Prompt (Baseline)

Start simple - just ask for JSON extraction with no structure.


In [17]:
prompt_v1 = f"""Extract the key information from this LinkedIn profile as JSON:

{linkedin_profile}
"""

response_v1 = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=prompt_v1,
)

print("=== Variation 1: Minimal Prompt ===")
print(response_v1.text)


=== Variation 1: Minimal Prompt ===
```json
{
  "name": "Hugo Bowne-Anderson",
  "professional_identities": [
    "Data Scientist",
    "Educator",
    "Writer",
    "Podcaster",
    "AI Scientist",
    "Consultant"
  ],
  "about": {
    "summary": "Hugo Bowne-Anderson is an independent data and AI scientist, consultant, writer, educator, and podcaster. He focuses on promoting data and AI literacy/fluency, spreading data skills across organizations and society, and lowering the barrier to entry for data science, analysis, and machine learning.",
    "mission_and_interests": [
      "Promoting data & AI literacy/fluency",
      "Spreading data skills through organizations and society",
      "Lowering the barrier to entry for data science, analysis, and machine learning",
      "Writing about all things data",
      "Speaking and teaching tutorials/workshops at conferences",
      "Exploring new ways to teach and discuss data science, ML, and AI"
    ]
  },
  "experience": [
    {
     

### Variation 5: With Schema Definition, JSON Mode, and Error Handling

This variation combines explicit field definitions, JSON mode, and missing data instructions.

In [9]:
prompt_v5 = f"""Extract structured information from the LinkedIn profile below.

Required fields:
- name (string): Full name of the person
- current_title (string): Most recent job title
- location (string): Current location
- years_experience (integer): Total years of professional experience (estimate if needed)
- top_skills (array): 3-5 key skills or areas of expertise
- recent_roles (array): 2-3 most recent positions with title, company, and duration
- education (string): Highest degree and institution

Important rules:
- If a field is missing or unclear, use null for strings or [] for arrays
- For years_experience, calculate from experience section dates
- Extract skills from the profile text, skills section, or infer from roles
- Return ONLY valid JSON, no markdown formatting

<linkedin_profile>
{linkedin_profile}
</linkedin_profile>
"""

response_v5 = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=prompt_v5,
    config={
        "response_mime_type": "application/json"
    }
)

print("=== Variation 5")
print(response_v5.text)

# Try to parse and validate
try:
    parsed_v5 = json.loads(response_v5.text)
    print("\nValid JSON!")
    print(f"\nExtracted fields: {list(parsed_v5.keys())}")
except json.JSONDecodeError:
    print("\n Invalid JSON")


=== Variation 5
{
  "name": "Hugo Bowne-Anderson",
  "current_title": "Independent Data and AI Scientist, Consultant, Writer, Educator & Podcaster",
  "location": "New York, New York",
  "years_experience": 13,
  "top_skills": [
    "Data Science",
    "Machine Learning",
    "Python",
    "Education",
    "AI Literacy"
  ],
  "recent_roles": [
    {
      "title": "Independent Data and AI Scientist, Consultant, Writer, Educator & Podcaster",
      "company": "Self-employed",
      "duration": null
    },
    {
      "title": "Head of Developer Relations",
      "company": "Outerbounds",
      "duration": null
    },
    {
      "title": "Head of Marketing and Data Science Evangelism",
      "company": "Coiled",
      "duration": null
    }
  ],
  "education": null
}

Valid JSON!

Extracted fields: ['name', 'current_title', 'location', 'years_experience', 'top_skills', 'recent_roles', 'education']


## Part 2: Structured Data → Personalized Recruiter Email

### Goal
Generate a personalized, professional recruiter outreach email from the structured candidate data.

### Job Position Context
Let's say we're recruiting for a "Senior AI/ML Engineer" position. We'll test different approaches to see what produces the best emails.


In [10]:
# First, let's extract the candidate data using our best prompt
def extract_profile_data(linkedin_text):
    """Extract structured JSON from LinkedIn profile text using our best prompt."""
    prompt = f"""Extract structured information from the LinkedIn profile below.

Required fields:
- name (string): Full name of the person
- current_title (string): Most recent job title
- location (string): Current location
- years_experience (integer): Total years of professional experience
- top_skills (array): 3-5 key skills or areas of expertise
- recent_roles (array): 2-3 most recent positions with title, company, and duration
- education (string): Highest degree and institution

Important rules:
- If a field is missing or unclear, use null for strings or [] for arrays
- For years_experience, calculate from experience section dates
- Extract skills from the profile text, skills section, or infer from roles
- Return ONLY valid JSON, no markdown formatting

<linkedin_profile>
{linkedin_text}
</linkedin_profile>
"""
    
    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=prompt,
        config={
            "response_mime_type": "application/json"
        }
    )
    
    return json.loads(response.text)

# Extract candidate data
candidate_data = extract_profile_data(linkedin_profile)
print("=== Extracted Candidate Data ===")
print(json.dumps(candidate_data, indent=2))


=== Extracted Candidate Data ===
{
  "name": "Hugo Bowne-Anderson",
  "current_title": "Independent Data and AI Scientist, Consultant, Writer, Educator & Podcaster",
  "location": null,
  "years_experience": null,
  "top_skills": [
    "Data Science",
    "Machine Learning",
    "Python",
    "Data Literacy",
    "Podcasting"
  ],
  "recent_roles": [
    {
      "title": "Independent Data and AI Scientist, Consultant, Writer, Educator & Podcaster",
      "company": "Self-Employed",
      "duration": null
    },
    {
      "title": "Head of Developer Relations",
      "company": "Outerbounds",
      "duration": null
    },
    {
      "title": "Head of Marketing and Data Science Evangelism",
      "company": "Coiled",
      "duration": null
    }
  ],
  "education": null
}


In [11]:
# Job description for context
job_description = """Senior AI/ML Engineer
Location: Remote (US-based)
Salary: $180k-$250k + equity

We're looking for an experienced AI/ML engineer to join our team building next-generation 
machine learning infrastructure. You'll work on scaling ML systems, building developer tools, 
and helping data scientists deploy models to production.

Requirements:
- 5+ years experience in ML/AI or data science
- Strong Python skills
- Experience with ML frameworks and infrastructure
- Track record of building tools that others love to use
"""

print("Job we're recruiting for:")
print(job_description)


Job we're recruiting for:
Senior AI/ML Engineer
Location: Remote (US-based)
Salary: $180k-$250k + equity

We're looking for an experienced AI/ML engineer to join our team building next-generation 
machine learning infrastructure. You'll work on scaling ML systems, building developer tools, 
and helping data scientists deploy models to production.

Requirements:
- 5+ years experience in ML/AI or data science
- Strong Python skills
- Experience with ML frameworks and infrastructure
- Track record of building tools that others love to use



### Email Variation 1: Minimal (Baseline)

Simple request with no specific guidance.


In [None]:
email_prompt_v1 = f"""Write a recruiter email to this candidate:

{json.dumps(candidate_data, indent=2)}

Job:
{job_description}
"""

email_v1 = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents=email_prompt_v1,
)

print("=== Email V1: Minimal ===")
print(email_v1.text)


[1/1] Searching for Hugo Bowne-Anderson...
  Tavily found 5 results:
    1. https://www.youtube.com/watch?v=eC3RNuI6ow0 - How to Build and Evaluate AI systems in the Age of LLMs - YouTube
    2. https://au.linkedin.com/in/hugo-bowne-anderson-045939a5 - Hugo Bowne-Anderson - Data and AI scientist, consultant ... - LinkedIn
    3. https://x.com/hugobowne?lang=en - Hugo Bowne-Anderson (@hugobowne) / Posts / X - Twitter
    4. https://www.datacamp.com/podcast/becoming-a-data-scientist - Becoming a Data Scientist | DataCamp
    5. https://summit.dask.org/speaker/profile/78/hugo-bowne-anderson/ - Hugo Bowne-Anderson | Dask Distributed Summit 2021
  ❌ No good match found.


### Email Variation 2: With Guardrails and Personalization Requirements

This variation includes specific requirements and explicit guardrails about what NOT to include.

In [13]:
email_prompt_v5 = f"""You are a professional technical recruiter. Write a personalized outreach email.

Candidate:
{json.dumps(candidate_data, indent=2)}

Job:
{job_description}

Guidelines:
- Keep it under 250 words
- Reference 2-3 specific things from their background
- Explain why this role is a good fit for THEM (not just why they're good for us)
- Professional and respectful tone
- Include a clear subject line

DO NOT:
- Use phrases like "I hope this email finds you well" or "exciting opportunity"
- Include salary information (mention "competitive compensation" only)
- Make it sound like a form letter
- Oversell or use excessive exclamation marks
- Include your signature (just end with a question)
"""

email_v5 = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=email_prompt_v5,
)

print("=== Email V5: Best Approach ===")
print(email_v5.text)


=== Email V5: Best Approach ===
Subject: ML Infrastructure & Developer Tools - A Potential Fit with Your Background

Hi Hugo,

Your work as an Independent Data and AI Scientist, especially your dedication to data literacy and evangelism, truly stands out. I also noted your impactful contributions as Head of Developer Relations at Outerbounds and Head of Marketing and Data Science Evangelism at Coiled.

My team is building next-generation machine learning infrastructure, focusing on scaling ML systems, creating robust developer tools, and streamlining model deployment for data scientists. We're seeking an experienced Senior AI/ML Engineer with strong Python skills and a track record of building tools that others love to use.

Given your unique blend of deep technical understanding and a clear talent for empowering data scientists – evident through your dev relations experience and podcasting – I believe you could play a pivotal role in shaping how practitioners interact with powerful ML

## Complete Two-Stage Pipeline

Now let's combine both stages into a complete, reusable function.


In [14]:
def generate_recruiter_email_pipeline(linkedin_text, job_description):
    """Complete pipeline: LinkedIn text → Structured data → Personalized email"""
    
    # Stage 1: Extract structured data
    print("Stage 1: Extracting candidate data...")
    candidate_data = extract_profile_data(linkedin_text)
    print("Extracted candidate data")
    print(json.dumps(candidate_data, indent=2))
    
    # Stage 2: Generate email
    print("\nStage 2: Generating email...")
    email_prompt = f"""You are a professional technical recruiter. Write a personalized outreach email.

Candidate:
{json.dumps(candidate_data, indent=2)}

Job:
{job_description}

Guidelines:
- Keep it under 250 words
- Reference 2-3 specific things from their background
- Explain why this role is a good fit for THEM
- Professional and respectful tone
- Include a clear subject line

DO NOT:
- Use phrases like "I hope this email finds you well" or "exciting opportunity"
- Make it sound like a form letter
- Oversell or use excessive exclamation marks
- Include your signature
"""
    
    email_response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=email_prompt,
    )
    
    print("Generated email\n")
    return email_response.text

# Test the complete pipeline
print("=== COMPLETE PIPELINE TEST ===")
final_email = generate_recruiter_email_pipeline(linkedin_profile, job_description)
print("\n" + "="*60)
print("FINAL EMAIL:")
print("="*60)
print(final_email)


=== COMPLETE PIPELINE TEST ===
Stage 1: Extracting candidate data...
Extracted candidate data
{
  "name": "Hugo Bowne-Anderson",
  "current_title": "Independent Data and AI Scientist, Educator",
  "location": null,
  "years_experience": null,
  "top_skills": [
    "Data Science",
    "Machine Learning",
    "Education",
    "Technical Writing",
    "Podcasting"
  ],
  "recent_roles": [
    {
      "title": "Head of Developer Relations",
      "company": "Outerbounds",
      "duration": null
    },
    {
      "title": "Head of Marketing and Data Science Evangelism",
      "company": "Coiled",
      "duration": null
    },
    {
      "title": "Python Curriculum Engineer",
      "company": "DataCamp",
      "duration": null
    }
  ],
  "education": null
}

Stage 2: Generating email...
Generated email


FINAL EMAIL:
Subject: Senior AI/ML Engineer - ML Infrastructure & Developer Tools

Hi Hugo,

Your profile immediately caught my attention as I'm searching for a Senior AI/ML Engineer foc

## LLM Judge

In [20]:
EVALUATOR_PROMPT = """
You are an expert evaluator assessing outputs from an outreach automation system
that drafts personalized emails based on a recipient’s LinkedIn profile.

Your Task:
Determine if the assistant-generated outreach email is appropriately personalized
and relevant to the recipient’s background and role, as described in their LinkedIn profile.

Evaluation Criterion:
Personalization and Relevance

Definition of Pass/Fail:
- Fail: The email is generic, irrelevant, or mismatched to the recipient’s background,
  interests, or role. It could have been sent to anyone.
- Pass: The email clearly demonstrates understanding of the recipient’s professional
  background, role, or achievements. It references specific details from the profile
  and establishes a relevant connection or value proposition.

Profile Dimensions to Consider:
- Current role and industry
- Skills, interests, or accomplishments
- Relevance of the email’s purpose to the recipient’s background

Output Format:
Return your evaluation as a JSON object with two keys:
1. reasoning: A brief explanation (1–2 sentences) for your decision.
2. answer: Either "Pass" or "Fail".

Examples:
---
Input 1:
LinkedIn Summary:
"Head of Data Science at FinEdge. I lead teams building predictive models for
credit scoring and fraud detection. Passionate about applying ML in finance."

Generated Email:
"Hey there! I’m reaching out to connect and share a few exciting updates from our AI team.
We’ve been working on some cool tools for startups and would love to get your feedback!"

Evaluation:
{
  "reasoning": "The email is generic and makes no reference to the recipient’s background in
  financial ML or leadership. It could have been sent to anyone.",
  "answer": "Fail"
}
---
Input 2:
LinkedIn Summary:
"Head of Data Science at FinEdge. I lead teams building predictive models for
credit scoring and fraud detection. Passionate about applying ML in finance."

Generated Email:
"Hi Alex, I saw your work leading the data science team at FinEdge on credit scoring and
fraud detection—really impressive. We’ve built a platform for managing ML model drift
in financial institutions, and I’d love to hear your thoughts."

Evaluation:
{
  "reasoning": "The email references the recipient’s leadership role and domain focus, and
  clearly connects the product to financial ML applications.",
  "answer": "Pass"
}
---
Now, evaluate the following:

LinkedIn Summary: {{LINKEDIN_SUMMARY_HERE}}
Generated Email: {{GENERATED_EMAIL_HERE}}

Your JSON Evaluation:
"""

In [21]:
# Example emails to evaluate
examples = [
    {
        "linkedin_summary": "Senior ML Engineer at TechCorp specializing in NLP and conversational AI. 5 years experience building production models.",
        "email": """Subject: Your NLP work at TechCorp

Hi Sarah,

I saw your work on conversational AI at TechCorp and was particularly impressed by your focus on production ML systems. We're building an evaluation platform specifically for NLP models in production, and your experience with both model development and deployment would be invaluable.

Would you be open to a brief conversation about the challenges you've faced with model evaluation?

Best,
Alex"""
    },
    {
        "linkedin_summary": "Senior ML Engineer at TechCorp specializing in NLP and conversational AI. 5 years experience building production models.",
        "email": """Subject: Exciting opportunity!

Hi there!

I hope this email finds you well! We have an amazing opportunity in AI/ML that would be perfect for talented engineers like yourself.

Our company is doing cutting-edge work and we'd love to chat. Let me know if you're interested!

Thanks,
Mike"""
    },
    {
        "linkedin_summary": "Senior ML Engineer at TechCorp specializing in NLP and conversational AI. 5 years experience building production models.",
        "email": """Subject: ML position

Hi,

We're hiring ML engineers. You have ML experience so thought I'd reach out. The role involves building models and working with data.

Interested?

Thanks,
Jen"""
    },
    {
        "linkedin_summary": "Senior ML Engineer at TechCorp specializing in NLP and conversational AI. 5 years experience building production models.",
        "email": """Subject: ML role

Hi Sarah,

We're hiring ML engineers and saw you have ML experience. The role involves building models and working with data.

Let me know if you're interested.

Thanks,
Tom"""
    }
]

In [23]:
from google.genai import types

for i, example in enumerate(examples, 1):
    print(f"\n{'='*60}")
    print(f"EMAIL {i}")
    print('='*60)
    print(example['email'])
    
    judge_prompt = EVALUATOR_PROMPT.replace("{{LINKEDIN_SUMMARY_HERE}}", example['linkedin_summary']).replace("{{GENERATED_EMAIL_HERE}}", example['email'])
    
    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=judge_prompt,
        config=types.GenerateContentConfig(response_mime_type='application/json')
    )
    
    evaluation = json.loads(response.text)
    print(f"\n✓ Judgment: {evaluation['answer']}")
    print(f"Reasoning: {evaluation['reasoning']}")


EMAIL 1
Subject: Your NLP work at TechCorp

Hi Sarah,

I saw your work on conversational AI at TechCorp and was particularly impressed by your focus on production ML systems. We're building an evaluation platform specifically for NLP models in production, and your experience with both model development and deployment would be invaluable.

Would you be open to a brief conversation about the challenges you've faced with model evaluation?

Best,
Alex

✓ Judgment: Pass
Reasoning: The email specifically references the recipient's specialization in NLP and conversational AI at TechCorp, her experience with production ML systems, and proposes a relevant solution for NLP model evaluation.

EMAIL 2
Subject: Exciting opportunity!

Hi there!

I hope this email finds you well! We have an amazing opportunity in AI/ML that would be perfect for talented engineers like yourself.

Our company is doing cutting-edge work and we'd love to chat. Let me know if you're interested!

Thanks,
Mike

✓ Judgment: