# 01 Data Collection — Enhanced Pipeline (v0.2)

## Purpose
Scrape LinkedIn job listings, visit each job page to extract the full description,
and produce a **structured, ML-ready dataset** with normalized fields.

## Pipeline Phases
1. **Phase 1 — Listing Scrape**: Fetch job cards (title, company, location, link)
2. **Phase 2 — Detail Extraction**: Visit each link → pull full job description
3. **Phase 3 — NLP Parsing**: Title normalization, skills extraction, seniority inference, etc.
4. **Phase 4 — Quality & Output**: Validate schema → save versioned Parquet + CSV

## IO
- **Input**: Search parameters (keywords, location, limit)
- **Output**:
  - `data/processed/jobs_latest.parquet` — structured, ML-ready dataset
  - `data/raw/raw_listings_*.json` — raw listing data (for reprocessing)
  - `outputs/tables/scrape_report_*.json` — quality report
  - `outputs/logs/scrape_*.log` — full run log

In [1]:
import sys, os

# Ensure the project root is on the path so `src.*` imports work
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
if project_root not in sys.path:
    sys.path.insert(0, project_root)

import pandas as pd
import json
import logging
from datetime import datetime

# Project modules
from src.scraping.linkedin_scraper import LinkedInScraper
from src.preprocessing.job_parser import parse_job
from src.preprocessing.vocabularies import SCHEMA_VERSION, DATASET_VERSION, SCHEMA_FIELDS
from src.pipeline import (
    setup_logging,
    validate_schema,
    save_dataset,
    save_raw_listings,
    save_scrape_report,
    run_pipeline,
)

# Ensure output directories exist
for d in ["../data/raw", "../data/processed", "../data/interim",
          "../outputs/logs", "../outputs/tables", "../outputs/figures"]:
    os.makedirs(d, exist_ok=True)

print(f"Schema version: {SCHEMA_VERSION}")
print(f"Dataset version: {DATASET_VERSION}")
print(f"Expected fields: {len(SCHEMA_FIELDS)}")

Schema version: 0.2
Dataset version: v0.2
Expected fields: 25


## Configuration
Set search parameters here. Keep `limit` at 20–30 during development.

In [2]:
# ── Search parameters ──
KEYWORDS = "Data Scientist"
LOCATION = ""          # Empty = worldwide
LIMIT = 25             # Keep at 20-30 for dev; scale later

# ── Output paths (relative to notebook) ──
RAW_OUTPUT_DIR = "../data/raw"
PROCESSED_OUTPUT_DIR = "../data/processed"

print(f"Search: '{KEYWORDS}' | Location: '{LOCATION or 'worldwide'}' | Limit: {LIMIT}")

Search: 'Data Scientist' | Location: 'worldwide' | Limit: 25


## Run Pipeline
Execute the full scraping + enrichment + parsing pipeline.

This will:
1. Scrape listing cards from LinkedIn (public guest API)
2. Visit each job URL and extract the full description
3. Parse each description into structured fields (title, skills, seniority, etc.)
4. Apply quality gates (discard records with no description or no skills)
5. Save versioned Parquet + CSV outputs

In [3]:
# Set working directory to project root for pipeline paths
os.chdir(project_root)

# Run the full pipeline
df = run_pipeline(
    keywords=KEYWORDS,
    location=LOCATION,
    limit=LIMIT,
    output_dir=PROCESSED_OUTPUT_DIR,
    raw_output_dir=RAW_OUTPUT_DIR,
    save_csv=True,
)

if df is not None:
    print(f"\n{'='*60}")
    print(f"SUCCESS: {len(df)} structured job records produced")
    print(f"{'='*60}")
else:
    print("Pipeline returned no data. Check logs in outputs/logs/")

2026-02-10 20:53:10,288 | INFO    | src.pipeline | TOP APPLICANT — Data Collection Pipeline
2026-02-10 20:53:10,289 | INFO    | src.pipeline | Schema version: 0.2
2026-02-10 20:53:10,292 | INFO    | src.pipeline | Dataset version: v0.2
2026-02-10 20:53:10,293 | INFO    | src.pipeline | Search: 'Data Scientist' | Location: '' | Limit: 25
2026-02-10 20:53:10,296 | INFO    | src.scraping.linkedin_scraper | Phase 1: Scraping up to 25 listings for 'Data Scientist' in ''
2026-02-10 20:53:12,025 | INFO    | src.scraping.linkedin_scraper |   [1] Data Scientist Intern @ Tinder
2026-02-10 20:53:12,026 | INFO    | src.scraping.linkedin_scraper |   [2] Data Scientist I, Integrity @ Tinder
2026-02-10 20:53:12,027 | INFO    | src.scraping.linkedin_scraper |   [3] Data Scientist (L5) , Performance Marketing @ Netflix
2026-02-10 20:53:12,028 | INFO    | src.scraping.linkedin_scraper |   [4] Head of Artificial Intelligence and Data Science @ Addison Group
2026-02-10 20:53:12,029 | INFO    | src.scrapin


SUCCESS: 25 structured job records produced


## Inspect Results
Quick validation that the data supports downstream analysis.

In [4]:
if df is not None and not df.empty:
    print("─── Schema Columns ───")
    print(f"Columns ({len(df.columns)}): {list(df.columns)}\n")

    print("─── Sample Record ───")
    sample = df.iloc[0].to_dict()
    for k, v in sample.items():
        val_repr = repr(v) if isinstance(v, list) else str(v)
        if len(val_repr) > 120:
            val_repr = val_repr[:120] + "..."
        print(f"  {k:30s} = {val_repr}")

    print("\n─── Distribution Checks ───")
    print(f"\nSeniority levels:\n{df['seniority_level'].value_counts().to_string()}")
    print(f"\nRole types:\n{df['role_type'].value_counts().to_string()}")
    print(f"\nWork modes:\n{df['work_mode'].value_counts().to_string()}")
    print(f"\nEmployment types:\n{df['employment_type'].value_counts().to_string()}")

    # Skill frequency (top 15)
    from collections import Counter
    all_skills = [s for skills in df['skills_required'] for s in skills]
    skill_freq = Counter(all_skills).most_common(15)
    print(f"\nTop 15 required skills:")
    for skill, count in skill_freq:
        pct = count / len(df) * 100
        print(f"  {skill:30s} {count:3d} ({pct:5.1f}%)")

    # Verify the 4 final checks
    print("\n─── Final Validation Checks ───")
    print(f"  ✓ Can compare Junior vs Senior: {df['seniority_level'].nunique() > 1}")
    print(f"  ✓ Can distinguish role types: {df['role_type'].nunique() > 1}")
    print(f"  ✓ Can compute skill frequency: {len(all_skills) > 0}")
    print(f"  ✓ Has raw descriptions: {df['job_description_raw'].notna().all()}")
else:
    print("No data available. Run the pipeline cell above first.")

─── Schema Columns ───
Columns (25): ['job_id', 'company', 'raw_title', 'normalized_title', 'title_keywords', 'role_type', 'seniority_level', 'skills_required', 'skills_optional', 'tools_frameworks', 'min_years_experience', 'experience_text', 'employment_type', 'work_mode', 'city', 'region', 'country', 'date_posted_raw', 'date_posted_normalized', 'job_description_raw', 'source', 'job_url', 'scrape_timestamp', 'schema_version', 'dataset_version']

─── Sample Record ───
  job_id                         = 2c6a4738b0788d7f
  company                        = Tinder
  raw_title                      = Data Scientist Intern
  normalized_title               = Data Scientist
  title_keywords                 = ['Intern']
  role_type                      = GenAI / LLM Engineer
  seniority_level                = Senior
  skills_required                = ['Large Language Models', 'Machine Learning', 'Python', 'SQL']
  skills_optional                = []
  tools_frameworks               = ['Python', 

## Skill Co-occurrence & Ideal Candidate Profile
Compute which skills appear together most often — the foundation for building an "ideal candidate profile".

In [5]:
if df is not None and not df.empty:
    from itertools import combinations
    from collections import Counter

    # Skill co-occurrence matrix (top pairs)
    pair_counts = Counter()
    for skills in df['skills_required']:
        for pair in combinations(sorted(set(skills)), 2):
            pair_counts[pair] += 1

    print("─── Top 15 Skill Co-occurrences ───")
    for (s1, s2), count in pair_counts.most_common(15):
        print(f"  {s1:25s} + {s2:25s}  ({count} jobs)")

    # Ideal candidate profile: average skills per seniority
    print("\n─── Skills by Seniority Level ───")
    for level in df['seniority_level'].unique():
        subset = df[df['seniority_level'] == level]
        all_req = [s for skills in subset['skills_required'] for s in skills]
        top_5 = Counter(all_req).most_common(5)
        avg_skills = subset['skills_required'].apply(len).mean()
        print(f"\n  {level} (n={len(subset)}, avg skills={avg_skills:.1f}):")
        for skill, cnt in top_5:
            print(f"    {skill}: {cnt}")

    # Save processed dataset path for downstream notebooks
    print(f"\n─── Output Files ───")
    print(f"  Parquet: data/processed/jobs_latest.parquet")
    print(f"  Use: pd.read_parquet('data/processed/jobs_latest.parquet')")
else:
    print("No data available.")

─── Top 15 Skill Co-occurrences ───
  Python                    + SQL                        (14 jobs)
  RAG                       + SQL                        (11 jobs)
  Machine Learning          + Python                     (10 jobs)
  Machine Learning          + SQL                        (9 jobs)
  RAG                       + Rust                       (8 jobs)
  Rust                      + SQL                        (7 jobs)
  R                         + SQL                        (7 jobs)
  Python                    + R                          (6 jobs)
  Python                    + RAG                        (6 jobs)
  A/B Testing               + RAG                        (5 jobs)
  A/B Testing               + SQL                        (5 jobs)
  Apache Spark              + SQL                        (5 jobs)
  RAG                       + Tableau                    (5 jobs)
  SQL                       + Tableau                    (5 jobs)
  Python                    + Scala  