# Data Science Jobs - API Scraping Notebook

## Project: Data Science Job Market Analysis
**Author:** Mayenmein Terence Sama Aloah Jr<br>
**Date:** 09/23/2025  
**Description:** This notebook handles the API-based scraping of data science job postings from Found.dev API.

In [12]:
import sys
import os
from pathlib import Path
import pandas as pd
import numpy as np
import requests
import time
from datetime import datetime
import json
# Add src to path
sys.path.insert(0, '..')
print("Libraries imported successfully!")

Libraries imported successfully!


## 1. Project Setup and Configuration
Configure paths and import the scraping module.

In [13]:
# Configuration
DATA_RAW_PATH = Path('../data/raw')

# Create directories if they don't exist
DATA_RAW_PATH.mkdir(parents=True, exist_ok=True)

print(f"📁 Data will be saved to: {DATA_RAW_PATH.absolute()}")

📁 Data will be saved to: c:\Users\MARIE\Desktop\scrape job details\notebooks\..\data\raw


In [14]:
# Import the scraping function
try:
    from scr.scraping.scrape_jobs import scrape_in_batches, fetch_jobs
    print("✅ Custom scraping modules imported successfully!")
except ImportError as e:
    print(f"❌ Error importing custom modules: {e}")    

✅ Custom scraping modules imported successfully!


## 2. API Connection Test
Test the API connection with a single page request.

In [15]:
# Test API connection
print("🧪 Testing API connection...")

try:
    test_data = fetch_jobs(page=1, skill="Data Science", ai=True)
    jobs_count = len(test_data.get("jobs", []))
    print(f"✅ API connection successful! Found {jobs_count} jobs on page 1")
    
    # Display sample job structure
    if jobs_count > 0:
        sample_job = test_data["jobs"][0]
        print("\n📋 Sample job structure:")
        print(json.dumps(sample_job, indent=2)[:500] + "...")
        
except Exception as e:
    print(f"❌ API test failed: {e}")

🧪 Testing API connection...
✅ API connection successful! Found 100 jobs on page 1

📋 Sample job structure:
{
  "job": {
    "created_at": "0001-01-01T00:00:00Z",
    "premium": false,
    "title": "AI Testing Lead",
    "roles": "",
    "skills": "UI,Data Science,Conversational AI,Product Management,Product Design,UX",
    "soft_skills": "",
    "tools": "",
    "languages": "",
    "frameworks": "",
    "libraries": "",
    "seniority": "",
    "type": "Full Time",
    "city": "San Bruno, CA",
    "country": "USA",
    "published": "2025-09-23T15:43:53.924045Z",
    "pin_until": "0001-01-01T00:00:00...


## 3. Scraping Parameters Configuration
Configure the scraping parameters for the batch process.

In [17]:
# Scraping configuration
SCRAPING_CONFIG = {
    "skill": "Data Science",
    "pages_per_batch": 20,
    "ai": True,
    "delay": 1,
    "start_page": 41,
    "start_batch": 1
}

print("⚙️  Scraping Configuration:")
for key, value in SCRAPING_CONFIG.items():
    print(f"   {key}: {value}")

⚙️  Scraping Configuration:
   skill: Data Science
   pages_per_batch: 20
   ai: True
   delay: 1
   start_page: 41
   start_batch: 1


## 4. Execute Batch Scraping
Run the main scraping process to collect job data in batches.

In [18]:
# Start the batch scraping process
print("🚀 Starting batch scraping process...")
print("⏳ This may take several minutes depending on the number of pages...")

start_time = datetime.now()
print(f"🕐 Started at: {start_time.strftime('%Y-%m-%d %H:%M:%S')}")

try:
    total_jobs_collected = scrape_in_batches(
        skill=SCRAPING_CONFIG["skill"],
        pages_per_batch=SCRAPING_CONFIG["pages_per_batch"],
        ai=SCRAPING_CONFIG["ai"],
        delay=SCRAPING_CONFIG["delay"]
    )
    
    end_time = datetime.now()
    duration = end_time - start_time
    
    print(f"\n🎉 Scraping completed successfully!")
    print(f"📊 Total jobs collected: {total_jobs_collected}")
    print(f"⏱️  Duration: {duration}")
    print(f"🕐 Finished at: {end_time.strftime('%Y-%m-%d %H:%M:%S')}")
    
except Exception as e:
    print(f"❌ Scraping failed: {e}")

🚀 Starting batch scraping process...
⏳ This may take several minutes depending on the number of pages...
🕐 Started at: 2025-09-23 17:37:29

🚀 Starting batch 3 (pages 41 → 60)
Fetching page 41...
Fetching page 42...


KeyboardInterrupt: 

## 5. Data Quality Check
Verify the collected data and check for any issues.

In [8]:
# Check what batch files were created
batch_files = list(DATA_RAW_PATH.glob("jobs_batch_*.csv"))
print(f"📁 Found {len(batch_files)} batch files:")

for batch_file in sorted(batch_files):
    df = pd.read_csv(batch_file)
    print(f"   {batch_file.name}: {len(df)} jobs")

# Load and examine the first batch
if batch_files:
    first_batch = pd.read_csv(batch_files[0])
    print(f"\n📊 Sample from first batch ({len(first_batch)} jobs):")
    print(f"   Columns: {list(first_batch.columns)}")
    print(f"   Date range: {first_batch['published'].min()} to {first_batch['published'].max()}")
    
    # Display first few rows
    print("\n📋 First 3 job titles:")
    for i, title in enumerate(first_batch['title'].head(3)):
        print(f"   {i+1}. {title}")

📁 Found 6 batch files:
   jobs_batch_1.csv: 2000 jobs
   jobs_batch_2.csv: 1999 jobs
   jobs_batch_3.csv: 2000 jobs
   jobs_batch_4.csv: 2000 jobs
   jobs_batch_5.csv: 2000 jobs
   jobs_batch_6.csv: 1625 jobs

📊 Sample from first batch (2000 jobs):
   Columns: ['title', 'company', 'city', 'country', 'location', 'skills', 'type', 'salary', 'salary_min', 'salary_max', 'published', 'ai']
   Date range: 2025-09-03T15:31:25.707176Z to 2025-09-23T12:23:09.98934Z

📋 First 3 job titles:
   1. Associate Research Director - Artificial Intelligence & Machine Learning
   2. ML Engineer
   3. Applied Scientist - Deep Learning


## 6. Data Summary Statistics
Generate basic statistics about the collected data.

In [9]:
# Combine all batches for summary statistics
if batch_files:
    all_data = pd.concat([pd.read_csv(f) for f in batch_files], ignore_index=True)
    
    print("📈 Data Summary Statistics:")
    print(f"Total jobs collected: {len(all_data)}")
    print(f"Unique companies: {all_data['company'].nunique()}")
    print(f"Unique locations: {all_data['location'].nunique()}")
    print(f"Date range: {all_data['published'].min()} to {all_data['published'].max()}")
    
    # Job type distribution
    print("\n📊 Job Type Distribution:")
    job_type_counts = all_data['type'].value_counts()
    for job_type, count in job_type_counts.items():
        percentage = (count / len(all_data)) * 100
        print(f"   {job_type}: {count} jobs ({percentage:.1f}%)")
    
    # AI-related jobs
    ai_jobs = all_data['ai'].sum() if 'ai' in all_data.columns else 0
    print(f"\n🤖 AI-related jobs: {ai_jobs} ({ (ai_jobs/len(all_data)*100 ):.1f}%)")

else:
    print("❌ No batch files found for analysis")

📈 Data Summary Statistics:
Total jobs collected: 11624
Unique companies: 1533
Unique locations: 912
Date range: 2023-10-23T14:23:55.119Z to 2025-09-23T12:23:09.98934Z

📊 Job Type Distribution:
   Full Time: 7777 jobs (66.9%)
   Remote,Full Time: 735 jobs (6.3%)
   Remote: 677 jobs (5.8%)
   Temporary: 644 jobs (5.5%)
   Intern: 281 jobs (2.4%)
   Freelancer: 174 jobs (1.5%)
   Vollzeit: 129 jobs (1.1%)
   Stage: 116 jobs (1.0%)
   Voltijds: 100 jobs (0.9%)
   Onsite: 86 jobs (0.7%)
   Remote,Full Time,Freelancer: 59 jobs (0.5%)
   Remote,Freelancer: 57 jobs (0.5%)
   Part Time: 56 jobs (0.5%)
   Full Time,Freelancer: 51 jobs (0.4%)
   Permanent: 47 jobs (0.4%)
   Internship: 46 jobs (0.4%)
   Remote,Internship: 39 jobs (0.3%)
   Full Time,Onsite: 38 jobs (0.3%)
   Full Time,Internship: 35 jobs (0.3%)
   Part Time,Full Time: 35 jobs (0.3%)
   Remote,Onsite: 28 jobs (0.2%)
   Hybrid: 26 jobs (0.2%)
   Employee: 25 jobs (0.2%)
   Heltid: 17 jobs (0.1%)
   A jornada completa: 15 jobs (0.1%

## 7. Data Validation
Check for data quality issues and missing values.

In [10]:
# Data quality check
if batch_files:
    print("🔍 Data Quality Check:")
    
    # Check for missing values
    missing_data = all_data.isnull().sum()
    print("\n❌ Missing values per column:")
    for column, missing_count in missing_data.items():
        if missing_count > 0:
            percentage = (missing_count / len(all_data)) * 100
            print(f"   {column}: {missing_count} missing ({percentage:.1f}%)")
    
    # Check for duplicates
    duplicates = all_data.duplicated().sum()
    print(f"\n🔄 Duplicate entries: {duplicates}")
    
    # Check data types
    print(f"\n📝 Data types:")
    print(all_data.dtypes)

else:
    print("❌ No data available for validation")

🔍 Data Quality Check:

❌ Missing values per column:
   city: 2597 missing (22.3%)
   country: 569 missing (4.9%)
   type: 50 missing (0.4%)
   salary: 7956 missing (68.4%)

🔄 Duplicate entries: 28

📝 Data types:
title         object
company       object
city          object
country       object
location      object
skills        object
type          object
salary        object
salary_min     int64
salary_max     int64
published     object
ai              bool
dtype: object


## 8. Save Metadata and Logs
Record scraping session information for reproducibility.

In [20]:
# Save scraping metadata
metadata = {
    "scraping_session": {
        "start_time": start_time.isoformat(),
        "end_time": end_time.isoformat() if 'end_time' in locals() else datetime.now().isoformat(),
        "duration_seconds": duration.total_seconds() if 'duration' in locals() else 0,
        "total_jobs_collected": total_jobs_collected if 'total_jobs_collected' in locals() else 0,
        "batches_created": len(batch_files),
        "configuration": SCRAPING_CONFIG
    }
}

# Save metadata to file
metadata_path = DATA_RAW_PATH / "scraping_metadata.json"
with open(metadata_path, 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"✅ Metadata saved to: {metadata_path}")

✅ Metadata saved to: ..\data\raw\scraping_metadata.json


## 9. Next Steps
Prepare for the next phase of the pipeline.

In [None]:
print("🎯 Next Steps:")
print("1. ✅ Data scraping completed")
print("2. ➡️  Proceed to 02_cleaning.ipynb for data cleaning")
print("3. 🔧 Clean the raw data and handle missing values")
print("4. 💾 Save cleaned data to data/interim/ directory")

# Display file sizes for reference
if batch_files:
    print(f"\n📊 File sizes:")
    for batch_file in sorted(batch_files):
        size_mb = os.path.getsize(batch_file) / (1024 * 1024)
        print(f"   {batch_file.name}: {size_mb:.2f} MB")

## Summary
- **API Source**: Found.dev jobs API
- **Data Collected**: Data Science job postings
- **Output**: Multiple CSV batches in `data/raw/` directory
- **Next Phase**: Data cleaning and preprocessing
- **Testing**: Unit tests are maintained separately in `tests/test_scraping.py`

The scraping process is complete and the data is ready for the next stage of the pipeline.