# LinkedIn Profile Extraction and Validation Workflow

## Introduction: Testing Outputs of Nondeterministic Software

In this Lightning Lesson, we will focus on testing the outputs of nondeterministic software, specifically Large Language Models (LLMs). While we'll use a recruitment example—extracting structured data from LinkedIn profiles and automating email outreach—the main goal is to understand how to systematically evaluate outputs from software that does not always behave predictably.

This process involves multiple layers of testing, which are crucial in ensuring that the final application meets business goals and delivers reliable results. The layers include:

1. **Validating Structured Output**: The first step is ensuring that the LLM returns the correct structured data. In this case, we extract information like the candidate's name, current role, location, previous roles, and education from their LinkedIn profile. The accuracy of this data is essential for the next steps in the process.

2. **Ensuring Email Quality**: After extracting structured data, the next step is generating automated outreach emails. These emails need to look good and require domain expertise to verify their content. For example, ensuring the tone is professional and the content is relevant to the business context is critical.

3. **Achieving Business Goals**: Ultimately, the goal is to ensure that the outreach leads to replies and successfully recruits quality candidates. This means verifying that the entire flow—from profile extraction to email generation—aligns with business goals and results in effective engagement.

This lesson will focus specifically on the first layer: validating the result of the LLM call. We'll explore how to test and refine LLM outputs before moving on to more complex evaluations in later stages of development.

This notebook walks through a workflow for extracting structured data from LinkedIn profiles using an LLM, validating the outputs, and preparing the data for domain expert review. The key steps include:

1. Extracting structured JSON data from unstructured LinkedIn text.
2. Validating the JSON output.
3. Flagging suspicious profiles based on missing or invalid fields.
4. Saving the data to a CSV file.
5. Reviewing and annotating the profiles in an interactive table.

This process can be applied to LinkedIn profiles for individuals or companies to assist with data validation, annotation, and review.

## Step 1: Testing the Pipeline with One Profile

Before applying the process to all profiles, we start by testing the pipeline with a single LinkedIn profile. This ensures that:
- The LLM generates structured JSON data.
- The JSON output is valid and complete.

We'll use one text file (`<profile_name>.txt`) as input and observe the output. 

In [1]:
# Read the file content into a string
file_path = "data/hbaLI.txt"  # Replace with your actual file path

# Open the file in read mode and read the content
with open(file_path, "r", encoding="utf-8") as file:
    linkedin_text = file.read()

# Print the content to verify
# print(linkedin_text)

In [2]:
import os
from openai import OpenAI


os.environ["OPENAI_API_KEY"] = 'XXX'
client = OpenAI()


# Define the user message (chat-style format)
messages = [
    {"role": "user", "content": f"""
Extract the following structured information from the text below:
- Name
- Current Role
- Location
- Previous Roles
- Education

Text: {linkedin_text}

Output the result as a JSON object.
"""}
]

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages
)
output = response.choices[0].message.content
print(output)


```json
{
  "Name": "Hugo Bowne-Anderson",
  "Current Role": "Independent Data and AI Scientist, Consultant, Writer, Educator, Podcaster",
  "Location": "Darlinghurst, New South Wales, Australia",
  "Previous Roles": [
    {
      "Title": "Head of Developer Relations",
      "Company": "Outerbounds",
      "Duration": "Feb 2022 - Aug 2024"
    },
    {
      "Title": "Head of Data Science Evangelism and Marketing",
      "Company": "Coiled",
      "Duration": "May 2020 - Oct 2021"
    },
    {
      "Title": "Data Scientist",
      "Company": "DataCamp",
      "Duration": "Sep 2017 - May 2020"
    },
    {
      "Title": "Curriculum Engineer (Python)",
      "Company": "DataCamp",
      "Duration": "Mar 2016 - May 2020"
    },
    {
      "Title": "Postdoctoral Associate/Writer",
      "Company": "Yale University",
      "Duration": "2013 - Mar 2016"
    }
  ],
  "Education": [
    {
      "Institution": "UNSW",
      "Degree": "Doctor of Philosophy (PhD)",
      "Field": "Pure Mathem

In [3]:
import json

def test_valid_json(output):
    try:
        parsed_output = json.loads(output)
        assert isinstance(parsed_output, dict), "Output is not a valid JSON object"
        print("Valid JSON!")
    except json.JSONDecodeError as e:
        print(f"Invalid JSON: {e}")
        raise

# Test the output
test_valid_json(output)

Invalid JSON: Expecting value: line 1 column 1 (char 0)


JSONDecodeError: Expecting value: line 1 column 1 (char 0)

## Step 2: Switching to OpenAI JSON Mode

Using OpenAI's JSON mode ensures that the output is structured as valid JSON by design. This step improved reliability and reduced the need for custom validation logic. We:
- Re-ran the API call with `response_format={"type": "json_object"}`.
- Validated that the output adheres to the required JSON structure.

Below is the implementation of JSON mode with a single test profile.

In [4]:
response = client.chat.completions.create(
    model="gpt-4o-mini",
    response_format={ "type": "json_object"},
    messages=messages
)
output = response.choices[0].message.content
print(output)

{
  "Name": "Hugo Bowne-Anderson",
  "Current Role": "Independent Data and AI Scientist",
  "Location": "Darlinghurst, New South Wales, Australia",
  "Previous Roles": [
    {
      "Title": "Head of Developer Relations",
      "Company": "Outerbounds",
      "Duration": "Feb 2022 - Aug 2024"
    },
    {
      "Title": "Head of Marketing and Data Science Evangelism",
      "Company": "Coiled",
      "Duration": "May 2020 - Oct 2021"
    },
    {
      "Title": "Data Scientist",
      "Company": "DataCamp",
      "Duration": "Sep 2017 - May 2020"
    },
    {
      "Title": "Curriculum Engineer (Python)",
      "Company": "DataCamp",
      "Duration": "Mar 2016 - May 2020"
    },
    {
      "Title": "Postdoctoral Associate/Writer",
      "Company": "Yale University",
      "Duration": "2013 - Mar 2016"
    }
  ],
  "Education": [
    {
      "Degree": "Doctor of Philosophy (PhD)",
      "Field": "Pure Mathematics",
      "Institution": "UNSW",
      "Duration": "2006 - 2011"
    },
  

In [5]:
# Test the output
test_valid_json(output)

Valid JSON!


In [6]:
def test_structure(parsed_output):
    required_fields = ["Name", "Current Role", "Location", "Previous Roles", "Education"]
    assert all(field in parsed_output for field in required_fields), "Missing required fields"
    assert isinstance(parsed_output["Previous Roles"], list), "Previous Roles should be a list"
    assert isinstance(parsed_output["Education"], list), "Education should be a list"
    print("Structure test passed!")

# Run the structure test
parsed_output = json.loads(output)
print(parsed_output)
test_structure(parsed_output)

{'Name': 'Hugo Bowne-Anderson', 'Current Role': 'Independent Data and AI Scientist', 'Location': 'Darlinghurst, New South Wales, Australia', 'Previous Roles': [{'Title': 'Head of Developer Relations', 'Company': 'Outerbounds', 'Duration': 'Feb 2022 - Aug 2024'}, {'Title': 'Head of Marketing and Data Science Evangelism', 'Company': 'Coiled', 'Duration': 'May 2020 - Oct 2021'}, {'Title': 'Data Scientist', 'Company': 'DataCamp', 'Duration': 'Sep 2017 - May 2020'}, {'Title': 'Curriculum Engineer (Python)', 'Company': 'DataCamp', 'Duration': 'Mar 2016 - May 2020'}, {'Title': 'Postdoctoral Associate/Writer', 'Company': 'Yale University', 'Duration': '2013 - Mar 2016'}], 'Education': [{'Degree': 'Doctor of Philosophy (PhD)', 'Field': 'Pure Mathematics', 'Institution': 'UNSW', 'Duration': '2006 - 2011'}, {'Degree': 'Bachelor of Science (B.S.) with First Class Honors', 'Fields': 'Mathematics, English Literature', 'Institution': 'University of Sydney', 'Duration': '2001 - 2005'}]}
Structure test

## Step 3: Batch Processing LinkedIn Profiles

After validating JSON mode with one profile, we extend the workflow to process multiple profiles. Text files containing LinkedIn profile data are stored in a `data/` directory and processed in batch. Key steps include:
- Reading each text file as input.
- Extracting structured data using OpenAI's API.
- Validating the JSON output for each profile.

In [7]:
# Directory where LinkedIn profile files are stored
profile_dir = "data"

# Function to validate JSON
def validate_json(raw_output):
    try:
        parsed_output = json.loads(raw_output)
        return parsed_output, True
    except json.JSONDecodeError as e:
        print(f"Invalid JSON: {e}")
        return None, False

# Function to process profiles
def process_profiles(directory):
    profiles = []
    for file in os.listdir(directory):
        if file.endswith(".txt"):
            with open(os.path.join(directory, file), "r") as f:
                linkedin_text = f.read()
                print(f"Processing: {file}")
                
                # Define the prompt for the LLM
                messages = [
                    {"role": "user", "content": f"""
                    Extract the following structured information from the text below:
                    - Name
                    - Current Role
                    - Location
                    - Previous Roles
                    - Education

                    Text: {linkedin_text}

                    Output the result as a JSON object.
                    """}
                ]

                # Make LLM API call
                try:
                    response = client.chat.completions.create(
                        model="gpt-4o-mini",
                        response_format={ "type": "json_object"},
                        messages=messages
                    )
                    raw_output = response.choices[0].message.content

                    # Validate JSON
                    parsed_output, is_valid = validate_json(raw_output)
                    if is_valid:
                        profiles.append({"file_name": file, "parsed_output": parsed_output})
                    else:
                        print(f"Skipping invalid JSON for file: {file}")
                except Exception as e:
                    print(f"Error processing file {file}: {e}")
    return profiles

In [8]:
profiles = process_profiles("data")

Processing: hamelLI.txt
Processing: dagworks.txt
Processing: hbaLI.txt
Processing: shreyaLI.txt
Processing: stefanLI.txt
Processing: chipLI.txt


In [9]:
import os

# Function to validate JSON output (you should define this or use your existing version)
def validate_json(json_string):
    try:
        import json
        parsed = json.loads(json_string)
        return parsed, True
    except json.JSONDecodeError:
        return None, False




# Function to process a single profile multiple times
def process_single_profile(file_path, iterations=20, print_after_iteration=True):
    outputs = []
    
    with open(file_path, "r") as f:
        linkedin_text = f.read()
        print(f"Processing profile: {file_path}")
        
        for i in range(iterations):
            print(f"\nIteration: {i + 1}")
            
            # Define the prompt for the LLM
            messages = [
                {"role": "user", "content": f"""
                Extract the following structured information from the text below:
                - Name
                - Current Role
                - Location
                - Previous Roles
                - Education

                Text: {linkedin_text}

                Output the result as a JSON object.
                """}
            ]

            # Make LLM API call
            try:
                response = client.chat.completions.create(
                    model="gpt-4o-mini",
                    response_format={"type": "json_object"},
                    messages=messages
                )
                raw_output = response.choices[0].message.content
                
                # Validate JSON
                parsed_output, is_valid = validate_json(raw_output)
                if is_valid:
                    outputs.append({"iteration": i + 1, "parsed_output": parsed_output})
                    if print_after_iteration:
                        print(f"Output for Iteration {i + 1}: {parsed_output}")
                else:
                    print(f"Invalid JSON output at iteration {i + 1}. Skipping.")
            except Exception as e:
                print(f"Error on iteration {i + 1}: {e}")

    return outputs

# Example usage
file_path='data/stefanLI.txt'
process_single_profile(file_path)

Processing profile: data/stefanLI.txt

Iteration: 1
Output for Iteration 1: {'Name': 'Stefan Krawczyk', 'Current Role': 'CEO @ DAGWorks Inc.', 'Location': 'San Francisco, California, United States', 'Previous Roles': ['Co-creator of Hamilton & Burr', 'Pipelines & Agents: Data, Data Science, Machine Learning, & LLMs'], 'Education': []}

Iteration: 2
Output for Iteration 2: {'Name': 'Stefan Krawczyk', 'Current Role': 'CEO @ DAGWorks Inc.', 'Location': 'San Francisco, California, United States', 'Previous Roles': ['Co-creator of Hamilton & Burr', 'Pipelines & Agents: Data, Data Science, Machine Learning, & LLMs'], 'Education': []}

Iteration: 3
Output for Iteration 3: {'Name': 'Stefan Krawczyk', 'Current Role': 'CEO @ DAGWorks Inc.', 'Location': 'San Francisco, California, United States', 'Previous Roles': ['Co-creator of Hamilton & Burr', 'Pipelines & Agents: Data, Data Science, Machine Learning, & LLMs'], 'Education': None}

Iteration: 4
Output for Iteration 4: {'Name': 'Stefan Krawczyk

[{'iteration': 1,
  'parsed_output': {'Name': 'Stefan Krawczyk',
   'Current Role': 'CEO @ DAGWorks Inc.',
   'Location': 'San Francisco, California, United States',
   'Previous Roles': ['Co-creator of Hamilton & Burr',
    'Pipelines & Agents: Data, Data Science, Machine Learning, & LLMs'],
   'Education': []}},
 {'iteration': 2,
  'parsed_output': {'Name': 'Stefan Krawczyk',
   'Current Role': 'CEO @ DAGWorks Inc.',
   'Location': 'San Francisco, California, United States',
   'Previous Roles': ['Co-creator of Hamilton & Burr',
    'Pipelines & Agents: Data, Data Science, Machine Learning, & LLMs'],
   'Education': []}},
 {'iteration': 3,
  'parsed_output': {'Name': 'Stefan Krawczyk',
   'Current Role': 'CEO @ DAGWorks Inc.',
   'Location': 'San Francisco, California, United States',
   'Previous Roles': ['Co-creator of Hamilton & Burr',
    'Pipelines & Agents: Data, Data Science, Machine Learning, & LLMs'],
   'Education': None}},
 {'iteration': 4,
  'parsed_output': {'Name': 'Ste

## Step 4: Validating JSON Output

Each JSON output from the LLM is validated to ensure:
- The JSON is syntactically valid (using `json.loads()`).
- All required fields are present (`Name`, `Current Role`, `Location`, `Previous Roles`, `Education`).
- The data types of fields match expectations:
  - `Previous Roles` and `Education` should be lists.
  - Other fields should be strings.

Profiles failing validation are flagged for review.

In [None]:
# Function to verify fields and structure
def verify_fields(json_obj):
    required_fields = ["Name", "Current Role", "Location", "Previous Roles", "Education"]
    
    # Check for required fields
    missing_fields = [field for field in required_fields if field not in json_obj]
    if missing_fields:
        print(f"❌ Missing fields: {missing_fields}")
        return False
    else:
        print(f"✅ All required fields are present")
    
    # Check data types for specific fields
    if not isinstance(json_obj["Previous Roles"], list):
        print("❌ 'Previous Roles' should be a list")
        return False
    if not isinstance(json_obj["Education"], list):
        print("❌ 'Education' should be a list")
        return False
    
    # All checks passed
    print("✅ Structure is valid")
    return True

# Apply field verification to profiles
valid_profiles = []
invalid_profiles = []

for profile in profiles:
    file_name = profile["file_name"]
    json_obj = profile["parsed_output"]
    print(f"📝 Verifying: {file_name}")
    
    if verify_fields(json_obj):
        valid_profiles.append(profile)
        print(f"✅ Profile from {file_name} is valid")
    else:
        print(f"❌ Invalid JSON structure in: {file_name}")
        invalid_profiles.append(profile)

# Log the results
print(f"✨ Valid profiles: {len(valid_profiles)}")
print(f"🚨 Invalid profiles: {len(invalid_profiles)}")

In [None]:
profiles


In [None]:
profiles[1]

## Step 5: Adding Flags to Suspicious Profiles

Profiles with missing or unexpected fields are flagged for domain expert review. Examples of flags include:
- Missing fields like `Previous Roles` or `Education`.
- Incorrectly formatted fields.

Flags help identify potential issues in profile extraction and validation logic. Below are examples of flagged profiles.

In [None]:
def flag_suspicious_profiles(profile):
    parsed_output = profile["parsed_output"]
    flags = []

    # Check for empty important fields
    if not parsed_output.get("Previous Roles"):
        flags.append("Empty 'Previous Roles'")
    if not parsed_output.get("Education"):
        flags.append("Empty 'Education'")
    
    # Heuristic for mismatch: Current Role but no Previous Roles/Education
    if parsed_output.get("Current Role") and not parsed_output.get("Previous Roles") and not parsed_output.get("Education"):
        flags.append("No career or educational history provided")
    
    # Additional checks (e.g., raw text clues)
    raw_text = profile.get("raw_text", "")
    if "mission" in raw_text.lower() or "headquarters" in raw_text.lower():
        flags.append("Profile might be a company, not a person")
    
    return flags

# Apply the flagging to profiles
for profile in profiles:
    file_name = profile["file_name"]
    flags = flag_suspicious_profiles(profile)
    if flags:
        print(f"🚩 Flags for {file_name}: {', '.join(flags)}")
    else:
        print(f"✅ No flags for {file_name}")

## Step 7: Saving Data to CSV

Profiles are saved to a CSV file for further analysis and review. The CSV includes:
- Profile name.
- Extracted fields (e.g., `Name`, `Current Role`).
- Flags for validation issues.

In [None]:
import os
import json
import csv


# Function to save JSON data to CSV
def save_json_to_csv(profiles, output_file="profiles.csv"):
    # Collect all unique keys from the JSON objects to form the CSV headers
    all_keys = set()
    for profile in profiles:
        parsed_output = profile["parsed_output"]
        all_keys.update(parsed_output.keys())

    # Convert the set of keys into a sorted list to ensure consistent ordering in the CSV
    headers = sorted(list(all_keys))
    headers.insert(0, "file_name")  # Add file_name as the first column
    headers.append("flags")  # Add flags as the last column

    with open(output_file, mode="w", newline="", encoding="utf-8") as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=headers)
        writer.writeheader()

        for profile in profiles:
            # Start with the file_name and flags
            row = {"file_name": profile["file_name"], "flags": ", ".join(flag_suspicious_profiles(profile))}
            # Add the parsed JSON fields
            parsed_output = profile["parsed_output"]
            for key in headers:
                if key in parsed_output:
                    value = parsed_output[key]
                    # Convert lists or dictionaries into strings
                    if isinstance(value, list):
                        row[key] = "; ".join([str(item) for item in value])
                    elif isinstance(value, dict):
                        row[key] = str(value)
                    else:
                        row[key] = value
                elif key not in ["file_name", "flags"]:  # Avoid overwriting file_name or flags
                    row[key] = ""  # Leave empty if the field does not exist in the JSON
            writer.writerow(row)

    print(f"🚀 Profiles saved to {output_file}")


# Save to CSV
save_json_to_csv(profiles)

## Step 8: Displaying Data as Editable Table

The CSV data is rendered as an interactive table where:
- Flags and extracted fields can be manually reviewed.
- Annotations (e.g., comments, notes) can be added for each profile.

Below is the editable table for reviewing the profiles.

In [None]:
! pip install dash dash-table pandas

In [None]:
import dash
from dash import Dash, dash_table, html, Input, Output, ctx
import pandas as pd

# Load your CSV into a DataFrame
csv_file_path = "profiles.csv"
df = pd.read_csv(csv_file_path)

# Create a Dash app
app = Dash(__name__)

# Layout of the app
app.layout = html.Div([
    html.H1("Editable Profiles Table"),
    dash_table.DataTable(
        id='editable-table',
        columns=[{"name": col, "id": col, "editable": True} for col in df.columns],
        data=df.to_dict('records'),
        editable=True,
        row_deletable=True,
        style_table={'overflowX': 'auto'},
        style_cell={'textAlign': 'left', 'padding': '5px'},
    ),
    html.Button("Save Changes", id='save-button', n_clicks=0),
    html.Div(id='save-output', style={"margin-top": "20px"})
])

# Callback to save changes back to CSV
@app.callback(
    Output('save-output', 'children'),
    Input('save-button', 'n_clicks'),
    Input('editable-table', 'data')
)
def save_to_csv(n_clicks, rows):
    if "save-button" == ctx.triggered_id:  # Ensure save button was clicked
        edited_df = pd.DataFrame(rows)
        edited_df.to_csv(csv_file_path, index=False)
        return "✅ Changes saved to 'profiles_with_flags.csv'"
    return ""

# Run the app
if __name__ == '__main__':
    app.run_server(debug=True)

## Step 9: Adding Unit Tests for Profile Validation with Pytest

At this point, we’ve validated and flagged the profiles, and it’s time to implement unit tests using pytest to automate the validation process and ensure the robustness of our workflow. These tests will check that:
- The JSON output is syntactically correct.
- The required fields are present.
- The data types of the fields are correct.

### Test Setup

To integrate pytest, you will need to install the pytest package first.

### Writing Tests with Pytest

We’ll create a test file that includes the following tests:
- Test for valid JSON format: Ensures that the JSON output is correctly parsed.
- Test for required fields: Validates that the required fields (Name, Current Role, Location, Previous Roles, Education) are present in the profile.
- Test for data types: Ensures that fields like Previous Roles and Education are lists.
- Test for missing required field: Flags missing required fields.
- Test for invalid data type: Flags invalid data types for fields that should be lists.



In [None]:
!pip install ipytest

In [None]:
import ipytest
import pytest
import sys

# Clean up old test functions
test_funcs = [name for name in dir(sys.modules[__name__]) if name.startswith('test_')]
for func in test_funcs:
    delattr(sys.modules[__name__], func)

def test_profiles_structure():
    """Test that all profiles have the required keys"""
    required_fields = ["Name", "Current Role", "Location", "Previous Roles", "Education"]
    
    for idx, profile in enumerate(profiles):
        # Check top level structure
        assert isinstance(profile, dict), f"Profile {idx} should be a dictionary"
        assert "file_name" in profile, f"Profile {idx} missing file_name"
        assert "parsed_output" in profile, f"Profile {idx} missing parsed_output"
        
        # Check parsed_output has all required fields
        parsed = profile["parsed_output"]
        for field in required_fields:
            assert field in parsed, f"Profile {idx} missing {field} in parsed_output"

ipytest.run('-v')

In [None]:
                # Define the prompt for the LLM
                messages = [
                    {"role": "user", "content": f"""
                    Extract the following structured information from the text below:
                    - Name
                    - Current Role
                    - Location
                    - Previous Roles
                    - Education

                    Text: {linkedin_text}

                    Output the result as a JSON object.
                    """}
                ]