# LinkedIn Profile Extraction and Validation Workflow

This notebook walks through a workflow for extracting structured data from LinkedIn profiles using an LLM, validating the outputs, and preparing the data for domain expert review. The key steps include:
1. Extracting structured JSON data from unstructured LinkedIn text.
2. Validating the JSON output.
3. Flagging suspicious profiles based on missing or invalid fields.
4. Saving the data to a CSV file.
5. Reviewing and annotating the profiles in an interactive table.

This process can be applied to LinkedIn profiles for individuals or companies to assist with data validation, annotation, and review.

## Step 1: Testing the Pipeline with One Profile

Before applying the process to all profiles, we start by testing the pipeline with a single LinkedIn profile. This ensures that:
- The LLM generates structured JSON data.
- The JSON output is valid and complete.

We'll use one text file (`<profile_name>.txt`) as input and observe the output. 

In [1]:
# Read the file content into a string
file_path = "data/hbaLI.txt"  # Replace with your actual file path

# Open the file in read mode and read the content
with open(file_path, "r", encoding="utf-8") as file:
    linkedin_text = file.read()

# Print the content to verify
# print(linkedin_text)

In [None]:
import os
from openai import OpenAI


os.environ["OPENAI_API_KEY"] = 'XXX'
client = OpenAI()


# Define the user message (chat-style format)
messages = [
    {"role": "user", "content": f"""
Extract the following structured information from the text below:
- Name
- Current Role
- Location
- Previous Roles
- Education

Text: {linkedin_text}

Output the result as a JSON object.
"""}
]

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages
)
output = response.choices[0].message.content
print(output)


```json
{
  "Name": "Hugo Bowne-Anderson",
  "Current Role": "Independent Data and AI Scientist, Consultant, Writer, Educator, Podcaster",
  "Location": "Darlinghurst, New South Wales, Australia",
  "Previous Roles": [
    {
      "Role": "Head of Developer Relations",
      "Company": "Outerbounds",
      "Duration": "Feb 2022 - Aug 2024"
    },
    {
      "Role": "Head of Data Science Evangelism and Marketing",
      "Company": "Coiled",
      "Duration": "May 2020 - Oct 2021"
    },
    {
      "Role": "Data Scientist",
      "Company": "DataCamp",
      "Duration": "Sep 2017 - May 2020"
    },
    {
      "Role": "Curriculum Engineer (Python)",
      "Company": "DataCamp",
      "Duration": "Mar 2016 - May 2020"
    },
    {
      "Role": "Postdoctoral Associate/Writer",
      "Company": "Yale University",
      "Duration": "2013 - Mar 2016"
    }
  ],
  "Education": [
    {
      "Institution": "UNSW",
      "Degree": "Doctor of Philosophy (PhD)",
      "Field": "Pure Mathematics

In [3]:
import json

def test_valid_json(output):
    try:
        parsed_output = json.loads(output)
        assert isinstance(parsed_output, dict), "Output is not a valid JSON object"
        print("Valid JSON!")
    except json.JSONDecodeError as e:
        print(f"Invalid JSON: {e}")
        raise

# Test the output
test_valid_json(output)

Invalid JSON: Expecting value: line 1 column 1 (char 0)


JSONDecodeError: Expecting value: line 1 column 1 (char 0)

## Step 2: Switching to OpenAI JSON Mode

Using OpenAI's JSON mode ensures that the output is structured as valid JSON by design. This step improved reliability and reduced the need for custom validation logic. We:
- Re-ran the API call with `response_format={"type": "json_object"}`.
- Validated that the output adheres to the required JSON structure.

Below is the implementation of JSON mode with a single test profile.

In [4]:
response = client.chat.completions.create(
    model="gpt-4o-mini",
    response_format={ "type": "json_object"},
    messages=messages
)
output = response.choices[0].message.content
print(output)

{
  "Name": "Hugo Bowne-Anderson",
  "Current Role": "Independent Data and AI Scientist, Consultant, Writer, Educator, Podcaster",
  "Location": "Darlinghurst, New South Wales, Australia",
  "Previous Roles": [
    {
      "Title": "Head of Developer Relations",
      "Company": "Outerbounds",
      "Duration": "Feb 2022 - Aug 2024"
    },
    {
      "Title": "Head of Data Science Evangelism and Marketing",
      "Company": "Coiled",
      "Duration": "May 2020 - Oct 2021"
    },
    {
      "Title": "Data Scientist",
      "Company": "DataCamp",
      "Duration": "Sep 2017 - May 2020"
    },
    {
      "Title": "Curriculum Engineer (Python)",
      "Company": "DataCamp",
      "Duration": "Mar 2016 - May 2020"
    },
    {
      "Title": "Postdoctoral Associate/Writer",
      "Company": "Yale University",
      "Duration": "2013 - Mar 2016"
    }
  ],
  "Education": [
    {
      "Degree": "Doctor of Philosophy (PhD)",
      "Field": "Pure Mathematics",
      "Institution": "UNSW",


In [5]:
# Test the output
test_valid_json(output)

Valid JSON!


In [6]:
def test_structure(parsed_output):
    required_fields = ["Name", "Current Role", "Location", "Previous Roles", "Education"]
    assert all(field in parsed_output for field in required_fields), "Missing required fields"
    assert isinstance(parsed_output["Previous Roles"], list), "Previous Roles should be a list"
    assert isinstance(parsed_output["Education"], list), "Education should be a list"
    print("Structure test passed!")

# Run the structure test
parsed_output = json.loads(output)
print(parsed_output)
test_structure(parsed_output)

{'Name': 'Hugo Bowne-Anderson', 'Current Role': 'Independent Data and AI Scientist, Consultant, Writer, Educator, Podcaster', 'Location': 'Darlinghurst, New South Wales, Australia', 'Previous Roles': [{'Title': 'Head of Developer Relations', 'Company': 'Outerbounds', 'Duration': 'Feb 2022 - Aug 2024'}, {'Title': 'Head of Data Science Evangelism and Marketing', 'Company': 'Coiled', 'Duration': 'May 2020 - Oct 2021'}, {'Title': 'Data Scientist', 'Company': 'DataCamp', 'Duration': 'Sep 2017 - May 2020'}, {'Title': 'Curriculum Engineer (Python)', 'Company': 'DataCamp', 'Duration': 'Mar 2016 - May 2020'}, {'Title': 'Postdoctoral Associate/Writer', 'Company': 'Yale University', 'Duration': '2013 - Mar 2016'}], 'Education': [{'Degree': 'Doctor of Philosophy (PhD)', 'Field': 'Pure Mathematics', 'Institution': 'UNSW', 'Duration': '2006 - 2011'}, {'Degree': 'Bachelor of Science (B.S.) (First Class Honors)', 'Field': 'Mathematics, English Literature', 'Institution': 'University of Sydney', 'Durat

## Step 3: Batch Processing LinkedIn Profiles

After validating JSON mode with one profile, we extend the workflow to process multiple profiles. Text files containing LinkedIn profile data are stored in a `data/` directory and processed in batch. Key steps include:
- Reading each text file as input.
- Extracting structured data using OpenAI's API.
- Validating the JSON output for each profile.

In [7]:
# Directory where LinkedIn profile files are stored
profile_dir = "data"

# Function to validate JSON
def validate_json(raw_output):
    try:
        parsed_output = json.loads(raw_output)
        return parsed_output, True
    except json.JSONDecodeError as e:
        print(f"Invalid JSON: {e}")
        return None, False

# Function to process profiles
def process_profiles(directory):
    profiles = []
    for file in os.listdir(directory):
        if file.endswith(".txt"):
            with open(os.path.join(directory, file), "r") as f:
                linkedin_text = f.read()
                print(f"Processing: {file}")
                
                # Define the prompt for the LLM
                messages = [
                    {"role": "user", "content": f"""
                    Extract the following structured information from the text below:
                    - Name
                    - Current Role
                    - Location
                    - Previous Roles
                    - Education

                    Text: {linkedin_text}

                    Output the result as a JSON object.
                    """}
                ]

                # Make LLM API call
                try:
                    response = client.chat.completions.create(
                        model="gpt-4o-mini",
                        response_format={ "type": "json_object"},
                        messages=messages
                    )
                    raw_output = response.choices[0].message.content

                    # Validate JSON
                    parsed_output, is_valid = validate_json(raw_output)
                    if is_valid:
                        profiles.append({"file_name": file, "parsed_output": parsed_output})
                    else:
                        print(f"Skipping invalid JSON for file: {file}")
                except Exception as e:
                    print(f"Error processing file {file}: {e}")
    return profiles

In [8]:
profiles = process_profiles("data")

Processing: hamelLI.txt
Processing: dagworks.txt
Processing: hbaLI.txt
Processing: shreyaLI.txt
Processing: stefanLI.txt
Processing: chipLI.txt


## Step 4: Validating JSON Output

Each JSON output from the LLM is validated to ensure:
- The JSON is syntactically valid (using `json.loads()`).
- All required fields are present (`Name`, `Current Role`, `Location`, `Previous Roles`, `Education`).
- The data types of fields match expectations:
  - `Previous Roles` and `Education` should be lists.
  - Other fields should be strings.

Profiles failing validation are flagged for review.

In [9]:
# Function to verify fields and structure
def verify_fields(json_obj):
    required_fields = ["Name", "Current Role", "Location", "Previous Roles", "Education"]
    
    # Check for required fields
    missing_fields = [field for field in required_fields if field not in json_obj]
    if missing_fields:
        print(f"❌ Missing fields: {missing_fields}")
        return False
    else:
        print(f"✅ All required fields are present")
    
    # Check data types for specific fields
    if not isinstance(json_obj["Previous Roles"], list):
        print("❌ 'Previous Roles' should be a list")
        return False
    if not isinstance(json_obj["Education"], list):
        print("❌ 'Education' should be a list")
        return False
    
    # All checks passed
    print("✅ Structure is valid")
    return True

# Apply field verification to profiles
valid_profiles = []
invalid_profiles = []

for profile in profiles:
    file_name = profile["file_name"]
    json_obj = profile["parsed_output"]
    print(f"📝 Verifying: {file_name}")
    
    if verify_fields(json_obj):
        valid_profiles.append(profile)
        print(f"✅ Profile from {file_name} is valid")
    else:
        print(f"❌ Invalid JSON structure in: {file_name}")
        invalid_profiles.append(profile)

# Log the results
print(f"✨ Valid profiles: {len(valid_profiles)}")
print(f"🚨 Invalid profiles: {len(invalid_profiles)}")

📝 Verifying: hamelLI.txt
✅ All required fields are present
✅ Structure is valid
✅ Profile from hamelLI.txt is valid
📝 Verifying: dagworks.txt
✅ All required fields are present
✅ Structure is valid
✅ Profile from dagworks.txt is valid
📝 Verifying: hbaLI.txt
✅ All required fields are present
✅ Structure is valid
✅ Profile from hbaLI.txt is valid
📝 Verifying: shreyaLI.txt
✅ All required fields are present
✅ Structure is valid
✅ Profile from shreyaLI.txt is valid
📝 Verifying: stefanLI.txt
✅ All required fields are present
✅ Structure is valid
✅ Profile from stefanLI.txt is valid
📝 Verifying: chipLI.txt
✅ All required fields are present
✅ Structure is valid
✅ Profile from chipLI.txt is valid
✨ Valid profiles: 6
🚨 Invalid profiles: 0


In [10]:
profiles


[{'file_name': 'hamelLI.txt',
  'parsed_output': {'Name': 'Hamel Husain',
   'Current Role': 'Founder @ Parlance Labs | Entrepreneur in Residence, ML & Data Science',
   'Location': 'United States',
   'Previous Roles': [{'Role': 'Scout',
     'Company': 'Bain Capital',
     'Duration': 'Aug 2024 - Present'},
    {'Role': 'Entrepreneur in Residence',
     'Company': 'fast.ai',
     'Duration': 'Aug 2022 - May 2023'},
    {'Role': 'Core Contributor & Maintainer',
     'Company': 'fast.ai',
     'Duration': 'Sep 2019 - Aug 2022'},
    {'Role': 'Head of ML & Data Science',
     'Company': 'Outerbounds',
     'Duration': 'Jan 2022 - Aug 2022'},
    {'Role': 'Staff Machine Learning Engineer',
     'Company': 'GitHub',
     'Duration': 'Oct 2017 - Jan 2022'}],
   'Education': [{'Degree': 'Master of Science (M.S.)',
     'Field': 'Computer Science, Machine Learning',
     'Institution': 'Georgia Institute of Technology'},
    {'Degree': 'Doctor of Law (J.D.), Cum Laude',
     'Institution': '

In [18]:
profiles[1]

{'file_name': 'dagworks.txt',
 'parsed_output': {'Name': 'Thierry Jean',
  'Current Role': 'Founding Engineer',
  'Location': 'San Francisco, California',
  'Previous Roles': [],
  'Education': []}}

## Step 5: Adding Flags to Suspicious Profiles

Profiles with missing or unexpected fields are flagged for domain expert review. Examples of flags include:
- Missing fields like `Previous Roles` or `Education`.
- Incorrectly formatted fields.

Flags help identify potential issues in profile extraction and validation logic. Below are examples of flagged profiles.

In [16]:
def flag_suspicious_profiles(profile):
    parsed_output = profile["parsed_output"]
    flags = []

    # Check for empty important fields
    if not parsed_output.get("Previous Roles"):
        flags.append("Empty 'Previous Roles'")
    if not parsed_output.get("Education"):
        flags.append("Empty 'Education'")
    
    # Heuristic for mismatch: Current Role but no Previous Roles/Education
    if parsed_output.get("Current Role") and not parsed_output.get("Previous Roles") and not parsed_output.get("Education"):
        flags.append("No career or educational history provided")
    
    # Additional checks (e.g., raw text clues)
    raw_text = profile.get("raw_text", "")
    if "mission" in raw_text.lower() or "headquarters" in raw_text.lower():
        flags.append("Profile might be a company, not a person")
    
    return flags

# Apply the flagging to profiles
for profile in profiles:
    file_name = profile["file_name"]
    flags = flag_suspicious_profiles(profile)
    if flags:
        print(f"🚩 Flags for {file_name}: {', '.join(flags)}")
    else:
        print(f"✅ No flags for {file_name}")

✅ No flags for hamelLI.txt
🚩 Flags for dagworks.txt: Empty 'Previous Roles', Empty 'Education', No career or educational history provided
✅ No flags for hbaLI.txt
✅ No flags for shreyaLI.txt
🚩 Flags for stefanLI.txt: Empty 'Previous Roles', Empty 'Education', No career or educational history provided
✅ No flags for chipLI.txt


## Step 7: Saving Data to CSV

Profiles are saved to a CSV file for further analysis and review. The CSV includes:
- Profile name.
- Extracted fields (e.g., `Name`, `Current Role`).
- Flags for validation issues.

In [19]:
import os
import json
import csv


# Function to save JSON data to CSV
def save_json_to_csv(profiles, output_file="profiles.csv"):
    # Collect all unique keys from the JSON objects to form the CSV headers
    all_keys = set()
    for profile in profiles:
        parsed_output = profile["parsed_output"]
        all_keys.update(parsed_output.keys())

    # Convert the set of keys into a sorted list to ensure consistent ordering in the CSV
    headers = sorted(list(all_keys))
    headers.insert(0, "file_name")  # Add file_name as the first column
    headers.append("flags")  # Add flags as the last column

    with open(output_file, mode="w", newline="", encoding="utf-8") as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=headers)
        writer.writeheader()

        for profile in profiles:
            # Start with the file_name and flags
            row = {"file_name": profile["file_name"], "flags": ", ".join(flag_suspicious_profiles(profile))}
            # Add the parsed JSON fields
            parsed_output = profile["parsed_output"]
            for key in headers:
                if key in parsed_output:
                    value = parsed_output[key]
                    # Convert lists or dictionaries into strings
                    if isinstance(value, list):
                        row[key] = "; ".join([str(item) for item in value])
                    elif isinstance(value, dict):
                        row[key] = str(value)
                    else:
                        row[key] = value
                elif key not in ["file_name", "flags"]:  # Avoid overwriting file_name or flags
                    row[key] = ""  # Leave empty if the field does not exist in the JSON
            writer.writerow(row)

    print(f"🚀 Profiles saved to {output_file}")


# Save to CSV
save_json_to_csv(profiles)

🚀 Profiles saved to profiles.csv


## Step 8: Displaying Data as Editable Table

The CSV data is rendered as an interactive table where:
- Flags and extracted fields can be manually reviewed.
- Annotations (e.g., comments, notes) can be added for each profile.

Below is the editable table for reviewing the profiles.

In [20]:
! pip install dash dash-table pandas

Collecting dash
  Downloading dash-2.18.2-py3-none-any.whl.metadata (10 kB)
Collecting dash-table
  Downloading dash_table-5.0.0-py3-none-any.whl.metadata (2.4 kB)
Collecting Flask<3.1,>=1.0.4 (from dash)
  Downloading flask-3.0.3-py3-none-any.whl.metadata (3.2 kB)
Collecting Werkzeug<3.1 (from dash)
  Downloading werkzeug-3.0.6-py3-none-any.whl.metadata (3.7 kB)
Collecting plotly>=5.0.0 (from dash)
  Downloading plotly-5.24.1-py3-none-any.whl.metadata (7.3 kB)
Collecting dash-html-components==2.0.0 (from dash)
  Downloading dash_html_components-2.0.0-py3-none-any.whl.metadata (3.8 kB)
Collecting dash-core-components==2.0.0 (from dash)
  Downloading dash_core_components-2.0.0-py3-none-any.whl.metadata (2.9 kB)
Collecting retrying (from dash)
  Downloading retrying-1.3.4-py3-none-any.whl.metadata (6.9 kB)
Collecting itsdangerous>=2.1.2 (from Flask<3.1,>=1.0.4->dash)
  Downloading itsdangerous-2.2.0-py3-none-any.whl.metadata (1.9 kB)
Collecting blinker>=1.6.2 (from Flask<3.1,>=1.0.4->das

In [22]:
import dash
from dash import Dash, dash_table, html, Input, Output, ctx
import pandas as pd

# Load your CSV into a DataFrame
csv_file_path = "profiles.csv"
df = pd.read_csv(csv_file_path)

# Create a Dash app
app = Dash(__name__)

# Layout of the app
app.layout = html.Div([
    html.H1("Editable Profiles Table"),
    dash_table.DataTable(
        id='editable-table',
        columns=[{"name": col, "id": col, "editable": True} for col in df.columns],
        data=df.to_dict('records'),
        editable=True,
        row_deletable=True,
        style_table={'overflowX': 'auto'},
        style_cell={'textAlign': 'left', 'padding': '5px'},
    ),
    html.Button("Save Changes", id='save-button', n_clicks=0),
    html.Div(id='save-output', style={"margin-top": "20px"})
])

# Callback to save changes back to CSV
@app.callback(
    Output('save-output', 'children'),
    Input('save-button', 'n_clicks'),
    Input('editable-table', 'data')
)
def save_to_csv(n_clicks, rows):
    if "save-button" == ctx.triggered_id:  # Ensure save button was clicked
        edited_df = pd.DataFrame(rows)
        edited_df.to_csv(csv_file_path, index=False)
        return "✅ Changes saved to 'profiles_with_flags.csv'"
    return ""

# Run the app
if __name__ == '__main__':
    app.run_server(debug=True)

In [23]:
import dash
from dash import Dash, dash_table, html, Input, Output, ctx
import pandas as pd

# Load your CSV into a DataFrame
csv_file_path = "profiles.csv"
df = pd.read_csv(csv_file_path)

# Add new columns for "Acceptance" (checkbox) and "Notes"
df['Acceptance'] = ['' for _ in range(len(df))]  # Initial empty values for the checkbox column
df['Notes'] = ['' for _ in range(len(df))]  # Initial empty values for the notes column

# Create a Dash app
app = Dash(__name__)

# Layout of the app
app.layout = html.Div([
    html.H1("Editable Profiles Table"),
    dash_table.DataTable(
        id='editable-table',
        columns=[
            {"name": col, "id": col, "editable": True} if col not in ["Acceptance", "Notes"] else {"name": col, "id": col, "editable": True}
            for col in df.columns
        ],
        data=df.to_dict('records'),
        editable=True,
        row_deletable=True,
        style_table={'overflowX': 'auto'},
        style_cell={'textAlign': 'left', 'padding': '5px'},
    ),
    html.Button("Save Changes", id='save-button', n_clicks=0),
    html.Div(id='save-output', style={"margin-top": "20px"})
])

# Callback to save changes back to CSV
@app.callback(
    Output('save-output', 'children'),
    Input('save-button', 'n_clicks'),
    Input('editable-table', 'data')
)
def save_to_csv(n_clicks, rows):
    if "save-button" == ctx.triggered_id:  # Ensure save button was clicked
        edited_df = pd.DataFrame(rows)
        edited_df.to_csv(csv_file_path, index=False)
        return "✅ Changes saved to 'profiles_with_flags.csv'"
    return ""

# Run the app
if __name__ == '__main__':
    app.run_server(debug=True)