In [None]:
# Load O*NET data and extract unique occupation titles
ONET = pd.read_csv(f'{output_data_path}/ONET_cleaned_tasks.csv')

# Get all unique occupation titles from the dataset
occupations_list = sorted(ONET['Occupation Title'].unique().tolist())
print(f"Found {len(occupations_list)} unique occupations in the dataset:")

# Set seed for reproducible random sampling (left here commented for optional use)
# random.seed(42)
# np.random.seed(42)

# Optional: Randomly sample 10% of occupations for quick testing (kept as comment)
# sample_size = max(1, int(len(occupations_list) * 0.10))  # Ensure at least 1 occupation
# sampled_occupations = random.sample(occupations_list, sample_size)
# print(f"Randomly selected {len(sampled_occupations)} occupations (10% of total) for processing:")
# print(f"Sample: {sampled_occupations[:5]}..." if len(sampled_occupations) > 5 else f"Sample: {sampled_occupations}")

# Process each occupation (default: all occupations)
processed_count = 0
skipped_count = 0
error_count = 0

# Use full occupations list by default. If you want to run a smaller sample, uncomment the sampling lines above and set this variable to sampled_occupations
# occupations_to_process = sampled_occupations
occupations_to_process = occupations_list

for i, occupation in enumerate(occupations_to_process, 1):
    # Filter data for this occupation
    occupation_data = ONET[ONET['Occupation Title'] == occupation].copy()
    
    # Prepare task data
    occupation_task_data = occupation_data[['Task ID', 'Task Title', 'O*NET-SOC Code']].drop_duplicates().reset_index(drop=True)
    
    # Enhanced progress output
    num_tasks = len(occupation_task_data)
    print(f"\n[{i}/{len(occupations_to_process)}] {occupation}")
    
    # Extract task sequence
    output_file, already_existed = extract_task_sequence(occupation, occupation_task_data, output_data_path)
    
    if output_file is None:
        error_count += 1
    elif already_existed:
        print(f"   ⏭️  Already exists - skipping")
        skipped_count += 1
    else:
        processed_count += 1

Found 873 unique occupations in the dataset:

[1/873] Accountants and Auditors
   • 26 tasks, using 32000 max tokens


   • Raw JSON length: 3986
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Examine wheth...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Examine wheth...
   ✅ Successfully processed and saved task sequence

[2/873] Actors
   • 19 tasks, using 32000 max tokens


   • Raw JSON length: 2746
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Write origina...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Write origina...
   ✅ Successfully processed and saved task sequence

[3/873] Actuaries
   • 15 tasks, using 32000 max tokens


   • Raw JSON length: 2341
   • Raw JSON preview: [
  {"Task Position": 1, "Task Title": "Analyze st...
   • Cleaned JSON preview: [
  {"Task Position": 1, "Task Title": "Analyze st...
   ✅ Successfully processed and saved task sequence

[4/873] Acupuncturists
   • 18 tasks, using 32000 max tokens


   • Raw JSON length: 2714
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Adhere to loc...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Adhere to loc...
   ✅ Successfully processed and saved task sequence

[5/873] Acute Care Nurses
   • 27 tasks, using 32000 max tokens


   • Raw JSON length: 3972
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Assess urgent...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Assess urgent...
   ✅ Successfully processed and saved task sequence

[6/873] Adapted Physical Education Specialists
   • 20 tasks, using 32000 max tokens


   • Raw JSON length: 2961
   • Raw JSON preview: [{"Task Position":1,"Task Title":"Assist in screen...
   • Cleaned JSON preview: [{"Task Position":1,"Task Title":"Assist in screen...
   ✅ Successfully processed and saved task sequence

[7/873] Adhesive Bonding Machine Operators and Tenders
   ⏭️  Already exists - skipping

[8/873] Administrative Law Judges, Adjudicators, and Hearing Officers
   • 14 tasks, using 32000 max tokens


   • Raw JSON length: 2014
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Schedule hear...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Schedule hear...
   ✅ Successfully processed and saved task sequence

[9/873] Administrative Services Managers
   • 8 tasks, using 32000 max tokens


   • Raw JSON length: 920
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Set goals and...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Set goals and...
   ✅ Successfully processed and saved task sequence

[10/873] Adult Basic Education, Adult Secondary Education, and English as a Second Language Instructors
   • 39 tasks, using 32000 max tokens


   • Raw JSON length: 5484
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Confer with l...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Confer with l...
   ✅ Successfully processed and saved task sequence

[11/873] Advanced Practice Psychiatric Nurses
   • 24 tasks, using 32000 max tokens


   • Raw JSON length: 3299
   • Raw JSON preview: [
  {"Task Position": 1, "Task Title": "Document p...
   • Cleaned JSON preview: [
  {"Task Position": 1, "Task Title": "Document p...
   ✅ Successfully processed and saved task sequence

[12/873] Advertising Sales Agents
   • 20 tasks, using 32000 max tokens


   • Raw JSON length: 2650
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Attend sales ...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Attend sales ...
   ✅ Successfully processed and saved task sequence

[13/873] Advertising and Promotions Managers
   • 21 tasks, using 32000 max tokens


   • Raw JSON length: 3074
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Read trade jo...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Read trade jo...
   ✅ Successfully processed and saved task sequence

[14/873] Aerospace Engineering and Operations Technologists and Technicians
   • 12 tasks, using 32000 max tokens


   • Raw JSON length: 1666
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Identify requ...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Identify requ...
   ✅ Successfully processed and saved task sequence

[15/873] Aerospace Engineers
   • 16 tasks, using 32000 max tokens


   • Raw JSON length: 2717
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Analyze proje...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Analyze proje...
   ✅ Successfully processed and saved task sequence

[16/873] Agents and Business Managers of Artists, Performers, and Athletes
   • 14 tasks, using 32000 max tokens


   • Raw JSON length: 1850
   • Raw JSON preview: [{"Task Position":1,"Task Title":"Conduct audition...
   • Cleaned JSON preview: [{"Task Position":1,"Task Title":"Conduct audition...
   ✅ Successfully processed and saved task sequence

[17/873] Agricultural Engineers
   • 14 tasks, using 32000 max tokens


   • Raw JSON length: 2105
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Meet with cli...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Meet with cli...
   ✅ Successfully processed and saved task sequence

[18/873] Agricultural Equipment Operators
   • 17 tasks, using 32000 max tokens


   • Raw JSON length: 2422
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Observe and l...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Observe and l...
   ✅ Successfully processed and saved task sequence

[19/873] Agricultural Inspectors
   • 22 tasks, using 32000 max tokens


   • Raw JSON length: 3194
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Interpret and...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Interpret and...
   ✅ Successfully processed and saved task sequence

[20/873] Agricultural Sciences Teachers, Postsecondary
   • 23 tasks, using 32000 max tokens


   • Raw JSON length: 2823
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Keep abreast ...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Keep abreast ...
   ✅ Successfully processed and saved task sequence

[21/873] Agricultural Technicians
   • 26 tasks, using 32000 max tokens


   • Raw JSON length: 3722
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Perform tests...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Perform tests...
   ✅ Successfully processed and saved task sequence

[22/873] Air Traffic Controllers
   • 23 tasks, using 32000 max tokens


   • Raw JSON length: 3252
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Inspect, adju...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Inspect, adju...
   ✅ Successfully processed and saved task sequence

[23/873] Aircraft Cargo Handling Supervisors
   • 6 tasks, using 32000 max tokens


   • Raw JSON length: 725
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Determine the...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Determine the...
   ✅ Successfully processed and saved task sequence

[24/873] Aircraft Mechanics and Service Technicians
   • 38 tasks, using 32000 max tokens


   • Raw JSON length: 5527
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Read and inte...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Read and inte...
   ✅ Successfully processed and saved task sequence

[25/873] Aircraft Structure, Surfaces, Rigging, and Systems Assemblers
   • 30 tasks, using 32000 max tokens


   • Raw JSON length: 4650
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Read blueprin...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Read blueprin...
   ✅ Successfully processed and saved task sequence

[26/873] Airfield Operations Specialists
   ⏭️  Already exists - skipping

[27/873] Airline Pilots, Copilots, and Flight Engineers
   • 26 tasks, using 32000 max tokens


   • Raw JSON length: 3456
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Confer with f...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Confer with f...
   ✅ Successfully processed and saved task sequence

[28/873] Allergists and Immunologists
   ⏭️  Already exists - skipping

[29/873] Ambulance Drivers and Attendants, Except Emergency Medical Technicians
   • 11 tasks, using 32000 max tokens


   • Raw JSON length: 1234
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Earn and main...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Earn and main...
   ✅ Successfully processed and saved task sequence

[30/873] Amusement and Recreation Attendants
   • 19 tasks, using 32000 max tokens


   • Raw JSON length: 2575
   • Raw JSON preview: [{"Task Position":1,"Task Title":"Keep informed of...
   • Cleaned JSON preview: [{"Task Position":1,"Task Title":"Keep informed of...
   ✅ Successfully processed and saved task sequence

[31/873] Anesthesiologist Assistants
   ⏭️  Already exists - skipping

[32/873] Anesthesiologists
   • 18 tasks, using 32000 max tokens


   • Raw JSON length: 2671
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Examine patie...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Examine patie...
   ✅ Successfully processed and saved task sequence

[33/873] Animal Breeders
   ⏭️  Already exists - skipping

[34/873] Animal Caretakers
   • 22 tasks, using 32000 max tokens


   • Raw JSON length: 2642
   • Raw JSON preview: [{"Task Position":1,"Task Title":"Adjust controls ...
   • Cleaned JSON preview: [{"Task Position":1,"Task Title":"Adjust controls ...
   ✅ Successfully processed and saved task sequence

[35/873] Animal Control Workers
   • 16 tasks, using 32000 max tokens


   • Raw JSON length: 2103
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Answer inquir...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Answer inquir...
   ✅ Successfully processed and saved task sequence

[36/873] Animal Scientists
   • 9 tasks, using 32000 max tokens


   • Raw JSON length: 1389
   • Raw JSON preview: [
  {"Task Position": 1, "Task Title": "Conduct re...
   • Cleaned JSON preview: [
  {"Task Position": 1, "Task Title": "Conduct re...
   ✅ Successfully processed and saved task sequence

[37/873] Animal Trainers
   • 19 tasks, using 32000 max tokens


   • Raw JSON length: 2308
   • Raw JSON preview: [{"Task Position":1,"Task Title":"Advise animal ow...
   • Cleaned JSON preview: [{"Task Position":1,"Task Title":"Advise animal ow...
   ✅ Successfully processed and saved task sequence

[38/873] Anthropologists and Archeologists
   • 28 tasks, using 32000 max tokens


   • Raw JSON length: 4960
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Plan and dire...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Plan and dire...
   ✅ Successfully processed and saved task sequence

[39/873] Anthropology and Archeology Teachers, Postsecondary
   • 26 tasks, using 32000 max tokens


   • Raw JSON length: 3147
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Keep abreast ...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Keep abreast ...
   ✅ Successfully processed and saved task sequence

[40/873] Appraisers and Assessors of Real Estate
   • 29 tasks, using 32000 max tokens


   • Raw JSON length: 4315
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Establish uni...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Establish uni...
   ✅ Successfully processed and saved task sequence

[41/873] Arbitrators, Mediators, and Conciliators
   • 20 tasks, using 32000 max tokens


   • Raw JSON length: 2686
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Organize or d...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Organize or d...
   ✅ Successfully processed and saved task sequence

[42/873] Architects, Except Landscape and Naval
   • 24 tasks, using 32000 max tokens


   • Raw JSON length: 3399
   • Raw JSON preview: [
  {"Task Position": 1, "Task Title": "Develop ma...
   • Cleaned JSON preview: [
  {"Task Position": 1, "Task Title": "Develop ma...
   ✅ Successfully processed and saved task sequence

[43/873] Architectural and Civil Drafters
   • 25 tasks, using 32000 max tokens


   • Raw JSON length: 4214
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Obtain and as...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Obtain and as...
   ✅ Successfully processed and saved task sequence

[44/873] Architectural and Engineering Managers
   • 21 tasks, using 32000 max tokens


   • Raw JSON length: 2751
   • Raw JSON preview: [{"Task Position": 1, "Task Title": "Consult or ne...
   • Cleaned JSON preview: [{"Task Position": 1, "Task Title": "Consult or ne...
   ✅ Successfully processed and saved task sequence

[45/873] Architecture Teachers, Postsecondary
   ⏭️  Already exists - skipping

[46/873] Archivists
   • 13 tasks, using 32000 max tokens


   • Raw JSON length: 1926
   • Raw JSON preview: [
  {"Task Position": 1, "Task Title": "Specialize...
   • Cleaned JSON preview: [
  {"Task Position": 1, "Task Title": "Specialize...
   ✅ Successfully processed and saved task sequence

[47/873] Area, Ethnic, and Cultural Studies Teachers, Postsecondary
   ⏭️  Already exists - skipping

[48/873] Art Directors
   • 16 tasks, using 32000 max tokens


   • Raw JSON length: 2301
   • Raw JSON preview: [
  {"Task Position": 1, "Task Title": "Research c...
   • Cleaned JSON preview: [
  {"Task Position": 1, "Task Title": "Research c...
   ✅ Successfully processed and saved task sequence

[49/873] Art Therapists
   • 25 tasks, using 32000 max tokens
