# Mean Shift Segmentation Batch Runner

Author: [Jerry Clayton](https://github.com/jerry-clayton)

For: [Ni-Meister Lab](http://www.geography.hunter.cuny.edu/~wenge/)

Adapted from [Ian Grant's Script](https://github.com/i-c-grant/ni-meister-gedi-biomass/blob/main/run_on_maap.py)

#### This Jupyter notebook handles the batch processing of the modified Mean Shift tree segmentation workflow developed by the author and Dr. Ni-Meister on the NASA MAAP platform. 

#### The [AMS3D](https://www.sciencedirect.com/science/article/abs/pii/S0034425716302292) was first proposed by Ferraz et. al, and this implementation depends on Dr. Nikolai Knapp's [MeanShiftR](https://github.com/niknap/MeanShiftR/tree/master/R) package

#### This workflow is executed in four parts: Split, Segment, Reconcile, and Merge

In [1]:
import datetime
import logging
import os
import shutil
import time
import glob
import tarfile
import pickle

import warnings
from pathlib import Path
from typing import Dict, List

import click
import geopandas as gpd
import pandas as pd
from tqdm import tqdm
from geopandas import GeoDataFrame
from maap.maap import MAAP
from maap.Result import Granule

maap = MAAP(maap_host='api.maap-project.org')

In [2]:
def build_file_url(filename):
    url_first_part = "s3://maap-ops-workspace/jclayton0/sq_km_tiles_norm/" 
    url = f'{url_first_part}{filename}'
    return url

def build_test_file_url(filename):
    url_first_part = "s3://maap-ops-workspace/jclayton0/test-input-sm-tiles/" 
    url = f'{url_first_part}{filename}'
    return url

def get_split_kwargs(fileurl):
     job_kwargs = {
            "identifier": "Mean Shift Split LAS",
            "algo_id": "MS-Step-1-Split",
            "version": "main",
            "username": "jclayton0",
            "queue": "maap-dps-worker-64gb",
            "LAS": fileurl,
            "Subplot_width": 25,
            "Buffer_width": 10
    }
    
     return job_kwargs

def get_segment_kwargs(fileurl):
    job_kwargs = {
            "identifier": "Mean Shift Segment LAS",
            "algo_id": "MS-Step-2-Segment-v2",
            "version": "main",
            "username": "jclayton0",
            "queue": "maap-dps-cuny-worker-64gb",
            "Point Cloud RDS": fileurl,
            "Subplot_widthFrac_cores": 0.9
    }
    
    return job_kwargs

def get_reconcile_kwargs(tarball_url, las_url):
    job_kwargs = {
            "identifier": "Mean Shift Reconcile LAS",
            "algo_id": "MS-Step-3-Reconcile-v3",
            "version": "main",
            "username": "jclayton0",
            "queue": "maap-dps-worker-64gb",
            "tarball": tarball_url,
            "original_las": las_url
    }
    
    return job_kwargs

def get_merge_kwargs(fileurl):
    job_kwargs = {
            "identifier": "Mean Shift Merge Trees",
            "algo_id": "MS-Step-4-Merge",
            "version": "main",
            "username": "jclayton0",
            "queue": "maap-dps-worker-64gb",
            "segmented_las": fileurl
    }
    
    return job_kwargs



In [59]:
def local_url_to_s3(url):
    second = str.split(url,"my-private-bucket")[1]
    full = f"s3://maap-ops-workspace/jclayton0{second}"
    return full

def get_old_jobs_list(old_jobs_path):
    
    with open(old_jobs_path, 'r') as file:
        old_jobs = file.readlines()
    
    old_jobs = [line.strip() for line in old_jobs]
    return old_jobs

def get_succeeded_jobs_in_list(job_ids):
    
    succeeded_job_ids = [job_id for job_id in job_ids
                         if job_status_for(job_id) == "Succeeded"]
    return succeeded_job_ids

def get_failed_jobs_in_list(job_ids):
    
    failed_job_ids = [job_id for job_id in job_ids
                      if job_status_for(job_id) == "Failed"]
    return failed_job_ids

def get_other_status_jobs_in_list(job_ids):

    other_job_ids = [job_id for job_id in job_ids
                     if job_status_for(job_id)
                     not in ["Succeeded", "Failed"]]
    return other_job_ids

def get_unsuccessful_jobs_in_list(job_ids):
    
    unsuccessful_job_ids = [job_id for job_id in job_ids
        if job_status_for(job_id) != "Succeeded"]
    
    return unsuccessful_job_ids

import json
def get_failed_job_input_LAS(failed_json_path):

    with open(failed_json_path,'r') as file:
        data = json.load(file)
        
    return data.get('params').get('job_specification').get('params')[0].get('value')



In [4]:
def job_status_for(job_id: str) -> str:
    return maap.getJobStatus(job_id)

def job_result_for(job_id: str) -> str:
    return maap.getJobResult(job_id)[0]

def to_job_output_dir(job_result_url: str, username: str) -> str:
    return (f"/projects/my-private-bucket/"
            f"{job_result_url.split(f'/{username}/')[1]}")

def to_failed_job_params(job_result_url: str) -> str:
    return (f"../triaged-jobs/"
            f"{job_result_url.split(f'/triaged_job/')[1]}/_job.json")


def log_and_print(message: str):
    logging.info(message)
    click.echo(message)

def update_job_states(job_states: Dict[str, str],
                      final_states: List[str],
                      batch_size: int,
                      delay: int) -> Dict[str, str]:
    """Update the job states dictionary in place.

    Updating occurs in batches, with a delay in seconds between batches.

    Return the number of jobs updated to final states.
    """
    batch_count = 0
    n_updated_to_final = 0
    for job_id, state in job_states.items():
        if state not in final_states:
            new_state: str = job_status_for(job_id)
            job_states[job_id] = new_state
            if new_state in final_states:
                n_updated_to_final += 1
            batch_count += 1
        # Sleep after each batch to avoid overwhelming the API
        if batch_count == batch_size:
            time.sleep(delay)
            batch_count = 0

    return n_updated_to_final



In [5]:
maap.getQueues().json()

{'code': 200,
 'message': 'success',
 'queues': ['maap-dps-sandbox',
  'maap-dps-cuny-worker-64gb',
  'maap-dps-worker-64gb',
  'maap-dps-cuny-worker-512gb',
  'maap-dps-worker-32vcpu-64gb',
  'maap-dps-worker-32gb',
  'maap-dps-worker-8gb',
  'maap-dps-worker-16gb']}

In [5]:
old_jobs_path = "../my-private-bucket/run_output_20241115_131005/job_ids.txt"
old_output_dir = "../my-private-bucket/run_output_20241115_131005/"
old_jobs = get_old_jobs_list(old_jobs_path)
old_jobs

['b2f2322d-4cf9-4b8b-8825-6027004910a0',
 '59446f2e-42fd-4666-a43f-d4a343df4824',
 'd80f1710-9bc0-42c8-a57d-dc9556bb9368',
 'c4c20041-a659-497b-90ac-c1e9f6f0037d',
 '01f47c4c-7e86-456c-b76b-690a3e3a5a2b',
 'c14fc301-3497-4d5d-9794-526ec1d12d67',
 '1d420f6a-2c33-4952-9496-0724e28caa21',
 '62315dc4-6f4d-400d-9369-a5e4ae42e91a',
 'f68564cf-0f7d-4a66-9dff-424da5274366',
 '868e2dab-0118-4999-b6a2-ff860e0877da',
 'ae6a8cea-e1d2-4d4c-8145-d9e3c86ef651',
 '9ab1f156-940b-4e85-afaf-1ea4df03ad12',
 '7ce55e8c-b983-4350-8bfa-0a25961d605e',
 'ea7a32fc-c046-4ce8-9db4-1a32741ac34f',
 '6064599c-7eaf-421d-ae07-5c8e5bc81d59',
 '434b63a7-7d05-4209-b5a6-c02f640e6be4',
 'd6392c13-0fa8-4c48-a823-767fecec6600',
 '4dec228e-d76f-4b61-bfe7-6067949eb44b',
 '6c131c5b-f4b1-4587-b413-eeae1349f8ad',
 '00a36ef2-edaf-4bc9-9236-daee2c2347b4',
 'ca54f470-f35d-4312-b230-7346b71c7744',
 'c48609ba-9427-4cf0-b65e-43463d719211',
 'ff2f2e29-18e3-4ea4-83d1-4cbc4b3c7930',
 '40c94625-038f-4531-aa05-553f3ba9eb8f',
 '837f77d7-aabd-

In [6]:
## Source Dir for large tiles, relative path
long_list = glob.glob('../my-private-bucket/sq_km_tiles_norm/*')
test_list = glob.glob('../my-private-bucket/test-input-sm-tiles/*')

In [7]:
files = list()
for file in long_list:
    files.append(str.split(file, '/')[3])


In [8]:
files

['TEAK_large_001.las',
 'TEAK_large_002.las',
 'TEAK_large_003.las',
 'TEAK_large_004.las',
 'TEAK_large_005.las',
 'TEAK_large_006.las',
 'TEAK_large_007.las',
 'TEAK_large_008.las',
 'TEAK_large_009.las',
 'TEAK_large_010.las',
 'TEAK_large_011.las',
 'TEAK_large_012.las',
 'TEAK_large_013.las',
 'TEAK_large_014.las',
 'TEAK_large_015.las',
 'TEAK_large_016.las',
 'TEAK_large_017.las',
 'TEAK_large_018.las',
 'TEAK_large_019.las',
 'TEAK_large_020.las',
 'TEAK_large_021.las',
 'TEAK_large_022.las',
 'TEAK_large_023.las',
 'TEAK_large_024.las',
 'TEAK_large_025.las',
 'TEAK_large_026.las',
 'TEAK_large_027.las',
 'TEAK_large_028.las',
 'TEAK_large_029.las',
 'TEAK_large_030.las',
 'TEAK_large_031.las',
 'TEAK_large_032.las',
 'TEAK_large_033.las',
 'TEAK_large_034.las',
 'TEAK_large_035.las',
 'TEAK_large_036.las',
 'TEAK_large_037.las',
 'TEAK_large_038.las',
 'TEAK_large_039.las',
 'TEAK_large_040.las',
 'TEAK_large_041.las',
 'TEAK_large_042.las',
 'TEAK_large_043.las',
 'TEAK_larg

In [9]:
test_file = "s3://maap-ops-workspace/jclayton0/normalized/norm_TEAK_047_lidar_2021.las"
url = build_file_url(files[145])

## Get JobIDs from previous submissions

In [73]:
## Get the failed job IDs, find the submission parameter locations,
# and scrape the input file s3 url from the parameter JSON

failed_job_ids = get_failed_jobs_in_list(old_jobs)
failed_job_urls = [job_result_for(job) for job in failed_job_ids]
failed_params = [to_failed_job_params(url) for url in failed_job_urls]
failed_inputs = [get_failed_job_input_LAS(json) for json in failed_params]

failed_inputs

['s3://maap-ops-workspace/jclayton0/sq_km_tiles_norm/TEAK_large_008.las',
 's3://maap-ops-workspace/jclayton0/sq_km_tiles_norm/TEAK_large_009.las',
 's3://maap-ops-workspace/jclayton0/sq_km_tiles_norm/TEAK_large_010.las',
 's3://maap-ops-workspace/jclayton0/sq_km_tiles_norm/TEAK_large_011.las',
 's3://maap-ops-workspace/jclayton0/sq_km_tiles_norm/TEAK_large_012.las',
 's3://maap-ops-workspace/jclayton0/sq_km_tiles_norm/TEAK_large_013.las',
 's3://maap-ops-workspace/jclayton0/sq_km_tiles_norm/TEAK_large_014.las',
 's3://maap-ops-workspace/jclayton0/sq_km_tiles_norm/TEAK_large_055.las',
 's3://maap-ops-workspace/jclayton0/sq_km_tiles_norm/TEAK_large_097.las']

In [77]:
# resubmit_kwargs = [get_split_kwargs(url) for url in failed_inputs]
# resubmit_kwargs

In [None]:
failed_tiles = [tile.split('_norm/')[1] for tile in failed_inputs]
failed_tiles
#save them to a txtfile

## Run Split on all files and collect the outputs

In [10]:
start_time = datetime.datetime.now()

# Set up output directory
output_dir = Path(f"/projects/my-private-bucket/run_output_"
                      f"{start_time.strftime('%Y%m%d_%H%M%S')}")
os.makedirs(output_dir, exist_ok=False)

# Set up log
logging.basicConfig(filename=output_dir / "run.log",
                        level=logging.INFO,
                        format='%(asctime)s - %(message)s',
                        datefmt='%Y-%m-%d %H:%M:%S')

log_and_print(f"Starting new model run at MAAP at {start_time}.")



Starting new model run at MAAP at 2024-11-15 13:10:05.350759.


In [78]:
    # Submit jobs for each pair of granules
username = "jclayton0"
job_limit = 1000
check_interval = 90 #seconds between updates

##### REMOVE this block if doing new run
start_time = datetime.datetime.now()
log_and_print(f"Starting new model run at MAAP at {start_time}.")
output_dir = old_output_dir 
files = failed_inputs
######


    if job_limit:
        n_jobs = min(len(files), job_limit)
    else:
        n_jobs = len(files)
    log_and_print(f"Submitting {n_jobs} "
                  f"jobs.")

    job_kwargs_list = []
    for file in files:
        
        #job_kwargs = get_split_kwargs(build_file_url(file))
        job_kwargs = get_split_kwargs(build_file_url(file))

        job_kwargs_list.append(job_kwargs)

    jobs = []
    for job_kwargs in job_kwargs_list[:job_limit]:
        job = maap.submitJob(**job_kwargs)
        jobs.append(job)

    print(f"Submitted {len(jobs)} jobs.")

    job_ids = [job.id for job in jobs]

    # Write job IDs to a file in case processing is interrupted
    job_ids_file = output_dir / "job_ids.txt"
    with open(job_ids_file, 'w') as f:
        for job_id in job_ids:
            f.write(f"{job_id}\n")
    log_and_print(f"Job IDs written to {job_ids_file}")

    # Give the jobs time to start
    click.echo("Waiting for jobs to start...")
    time.sleep(10)

    # Initialize job states
    final_states = ["Succeeded", "Failed", "Deleted"]

    job_states = {job_id: "" for job_id in job_ids}
    update_job_states(job_states, final_states, batch_size=50, delay=10)

    known_completed = len([state for state in job_states.values()
                           if state in final_states])

    while True:
        try:
            with tqdm(total=len(job_ids), desc="Jobs Completed", unit="job") as pbar:
                while any(state not in final_states for state in job_states.values()):

                    # Update the job states
                    n_new_completed: int = update_job_states(job_states,
                                                             final_states,
                                                             batch_size = 50,
                                                             delay = 10)

                    # Update the progress bar
                    pbar.update(n_new_completed)
                    last_updated = datetime.datetime.now()
                    known_completed += n_new_completed
                    
                    status_counts = {status: list(job_states.values()).count(status)
                                     for status in final_states + ["Accepted", "Running"]}
                    status_counts["Other"] = len(job_states) - sum(status_counts.values())
                    status_counts["Last updated"] = last_updated.strftime("%H:%M:%S")

                    pbar.set_postfix(status_counts, refresh=True)

                    if known_completed == len(job_ids):
                        break

                    time.sleep(check_interval)

        except KeyboardInterrupt:
            print("Are you sure you want to cancel the process?")
            print("Press Ctrl+C again to confirm, or wait to continue.")
            try:
                time.sleep(3)
                print("Continuing...")
            except KeyboardInterrupt:
                print("Model run aborted.")
                pending_jobs = [job_id for job_id, state in job_states.items()
                                if state not in final_states]
                click.echo(f"Cancelling {len(pending_jobs)} pending jobs.")
                for job_id in pending_jobs:
                    maap.cancelJob(job_id)
                break
        else:
            break

    # Process the results once all jobs are completed
    succeeded_job_ids = get_succeeded_jobs_in_list(job_ids)
    
    failed_job_ids = get_failed_jobs_in_list(job_ids)

    other_job_ids = get_other_status_jobs_in_list(job_ids) 

    click.echo(f"Processing results for {len(succeeded_job_ids)} "
               f"succeeded jobs.")

    click.echo(f"Gathering tarball paths from succeeded jobs.")

    tar_paths = []
    for job_id in tqdm(succeeded_job_ids):
        job_result_url = job_result_for(job_id)
        time.sleep(1) # to avoid overwhelming the API
        job_output_dir = to_job_output_dir(job_result_url, username)
        # Find .tar.gz file in the output dir
        tar_file = [f for f in os.listdir(job_output_dir)
                     if f.endswith('.tar.gz')]
        if len(tar_file) > 1:
            warnings.warn(f"Multiple .tar.gz files found in "
                          f"{job_output_dir}.")
        if len(tar_file) == 0:
            warnings.warn(f"No .tar.gz files found in "
                          f"{job_output_dir}.")
        if tar_file:
            tar_paths.append(os.path.join(job_output_dir, tar_file[0]))

    # Log the succeeded and failed job IDs
    logging.info(f"{len(succeeded_job_ids)} jobs succeeded.")
    logging.info(f"Succeeded job IDs: {succeeded_job_ids}\n")
    logging.info(f"{len(failed_job_ids)} jobs failed.")
    logging.info(f"Failed job IDs: {failed_job_ids}\n")
    logging.info(f"{len(other_job_ids)} jobs in other states.")
    logging.info(f"Other job IDs: {other_job_ids}\n")

    # Copy all tarballs to the output directory
    click.echo(f"Copying {len(tar_paths)} Tarballs to {output_dir}.")
    copy_batch_count = 0
    for tar_path in tqdm(tar_paths):
        try:
            shutil.copy(tar_path, output_dir)
            copy_batch_count += 1
            if copy_batch_count == 50:
                time.sleep(60)
                copy_batch_count = 0
            else:
                time.sleep(2)
            
        except Exception as e:
            warnings.warn(f"Error copying {tar_path} to {output_dir}: {str(e)}")
            click.echo("Retrying in 10 seconds.")
            time.sleep(10)
            try:
                shutil.copy(tar_path, output_dir)
            except Exception as e:
                click.echo(f"Retry failed: {str(e)}")
                click.echo(f"Skipping {tar_path}.")
                continue

    # Compress the output directory
    # click.echo(f"Compressing output directory.")
    # shutil.make_archive(output_dir, 'zip', output_dir)
    # click.echo(f"Output directory compressed to {output_dir}.zip.")

    end_time = datetime.datetime.now()

    log_and_print(f"Model run completed at {end_time}.")
    
# if __name__ == "__main__":
#     main()

Starting new model run at MAAP at 2024-11-19 15:29:05.771577.
Submitting 9 jobs.
Submitted 9 jobs.


TypeError: unsupported operand type(s) for /: 'str' and 'str'

In [85]:
new_job_ids = job_ids
new_job_ids

['a39f19f7-b9dc-4f81-aaf9-d2a7d7d4bf0b',
 '43bb3357-a447-41a6-8d18-2c24947e349e',
 '886c04df-acbc-4e3e-8807-356951c86da9',
 '0e238056-d343-4924-b8ff-7e04d0fb7224',
 '7af49420-7269-4955-b6f9-2565849d8be1',
 '96001cf0-2a46-469f-9f40-d7335e8823d5',
 'cf180c75-0169-4c11-a8dc-96ae5bb1d4be',
 '7737a35c-5724-40b4-a352-b479b14df366',
 '668f99ba-5fca-4980-a82b-276d06681ca5']

In [6]:
# comment out if doing new stuff
job_ids = old_jobs
output_dir = old_output_dir
username = "jclayton0"

# Process the results once all jobs are completed
succeeded_job_ids = [job_id for job_id in job_ids
                     if job_status_for(job_id) == "Succeeded"]

failed_job_ids = [job_id for job_id in job_ids
                  if job_status_for(job_id) == "Failed"]

other_job_ids = [job_id for job_id in job_ids
                 if job_status_for(job_id)
                 not in ["Succeeded", "Failed"]]

click.echo(f"Processing results for {len(succeeded_job_ids)} "
           f"succeeded jobs.")

click.echo(f"Gathering tarball paths from succeeded jobs.")

tar_paths = []
for job_id in tqdm(succeeded_job_ids):
    job_result_url = job_result_for(job_id)
    time.sleep(1) # to avoid overwhelming the API
    job_output_dir = to_job_output_dir(job_result_url, username)
    # Find .tar.gz file in the output dir
    tar_file = [f for f in os.listdir(job_output_dir)
                 if f.endswith('.tar.gz')]
    if len(tar_file) > 1:
        warnings.warn(f"Multiple .tar.gz files found in "
                      f"{job_output_dir}.")
    if len(tar_file) == 0:
        warnings.warn(f"No .tar.gz files found in "
                      f"{job_output_dir}.")
    if tar_file:
        tar_paths.append(os.path.join(job_output_dir, tar_file[0]))

# Log the succeeded and failed job IDs
logging.info(f"{len(succeeded_job_ids)} jobs succeeded.")
logging.info(f"Succeeded job IDs: {succeeded_job_ids}\n")
logging.info(f"{len(failed_job_ids)} jobs failed.")
logging.info(f"Failed job IDs: {failed_job_ids}\n")
logging.info(f"{len(other_job_ids)} jobs in other states.")
logging.info(f"Other job IDs: {other_job_ids}\n")

Processing results for 239 succeeded jobs.
Gathering tarball paths from succeeded jobs.


100%|██████████| 239/239 [04:49<00:00,  1.21s/it]


In [9]:
## pickle tarpaths 

import pickle

# with open('tar_paths_step_1.pkl', 'wb') as f:
#    pickle.dump(tar_paths, f)
with open('tar_paths_step_1.pkl', 'rb') as f:
    tar_paths = pickle.load(f)


In [15]:
tar_paths = tar_paths[-18:]

In [16]:
output_dir = old_output_dir
outdir_files = os.listdir(output_dir)
# Copy all tarballs to the output directory
click.echo(f"Copying {len(tar_paths)} Tarballs to {output_dir}.")
copy_batch_count = 0
for tar_path in tqdm(tar_paths):
    fname = tar_path.split('/')[-1]
    if fname in outdir_files:
        copy_batch_count +=1
        continue
    else:
        try:
            shutil.copy(tar_path, output_dir)
            copy_batch_count += 1
            if copy_batch_count == 10:
                time.sleep(60)
                copy_batch_count = 0
            else:
                time.sleep(2)
            
        except Exception as e:
            warnings.warn(f"Error copying {tar_path} to {output_dir}: {str(e)}")
            click.echo("Retrying in 10 seconds.")
            time.sleep(10)
            try:
                shutil.copy(tar_path, output_dir)
            except Exception as e:
                click.echo(f"Retry failed: {str(e)}")
                click.echo(f"Skipping {tar_path}.")
                continue


# Compress the output directory
# click.echo(f"Compressing output directory.")
# shutil.make_archive(output_dir, 'zip', output_dir)
# click.echo(f"Output directory compressed to {output_dir}.zip.")

end_time = datetime.datetime.now()

log_and_print(f"Model run completed at {end_time}.")


Copying 18 Tarballs to ../my-private-bucket/run_output_20241115_131005/.


100%|██████████| 18/18 [02:16<00:00,  7.56s/it]

Model run completed at 2024-11-22 15:07:52.852485.





## Above, split all files in the directory by submitting a split job then gather the results into a single directory

## This has to be run several times. You also need, in the terminal, to type 

`ls -lhS | tac` 

# and remove any files that have a size of 0; they need to be re-transferred



## Then, run extract_tarballs.sh after modifying it to include the correct dir
ectory path. 

## It will take some TIME

In [38]:
# Directory containing your files
output_dir = old_output_dir

import os
import pickle
# Specify the directory to scan


# List directories that start with 'TEAK'
teak_directories = [
    name for name in os.listdir(output_dir) 
    if os.path.isdir(os.path.join(output_dir, name)) and name.startswith('TEAK')
]

print("Directories starting with 'TEAK':", len(teak_directories))

#path = teak_directories[19]
print(teak_directories[221:])

Directories starting with 'TEAK': 239
['TEAK_large_231', 'TEAK_large_232', 'TEAK_large_233', 'TEAK_large_234', 'TEAK_large_235', 'TEAK_large_237', 'TEAK_large_238', 'TEAK_large_239', 'TEAK_large_240', 'TEAK_large_241', 'TEAK_large_242', 'TEAK_large_243', 'TEAK_large_244', 'TEAK_large_245', 'TEAK_large_246', 'TEAK_large_247', 'TEAK_large_248', 'TEAK_large_249']


In [39]:
for path in teak_directories[221:]:
    print(f"making kwargs for {path}")
    tiles_path = output_dir + path + '/'
    kwargs_file = 'kwargs/' + path + '.pkl'
    tiles = [name for name in os.listdir(tiles_path) 
    if os.path.isfile(os.path.join(tiles_path, name)) and name.endswith('.rds')]
    
    kwargs_list = []
    for tile in tiles:
        tile_url = tiles_path + tile
        url = local_url_to_s3(tile_url)
        kwargs = get_segment_kwargs(url)
        kwargs_list.append(kwargs)
    
    with open(kwargs_file, 'wb') as f:
            pickle.dump(kwargs_list, f)
    
    print(f"{len(kwargs_list)} kwargs written to {kwargs_file}")
    
#print(f"first item: {kwargs_list[0]}")

making kwargs for TEAK_large_231
1600 kwargs written to kwargs/TEAK_large_231.pkl
making kwargs for TEAK_large_232
1197 kwargs written to kwargs/TEAK_large_232.pkl
making kwargs for TEAK_large_233
201 kwargs written to kwargs/TEAK_large_233.pkl
making kwargs for TEAK_large_234
19 kwargs written to kwargs/TEAK_large_234.pkl
making kwargs for TEAK_large_235
5 kwargs written to kwargs/TEAK_large_235.pkl
making kwargs for TEAK_large_237
18 kwargs written to kwargs/TEAK_large_237.pkl
making kwargs for TEAK_large_238
83 kwargs written to kwargs/TEAK_large_238.pkl
making kwargs for TEAK_large_239
14 kwargs written to kwargs/TEAK_large_239.pkl
making kwargs for TEAK_large_240
57 kwargs written to kwargs/TEAK_large_240.pkl
making kwargs for TEAK_large_241
169 kwargs written to kwargs/TEAK_large_241.pkl
making kwargs for TEAK_large_242
136 kwargs written to kwargs/TEAK_large_242.pkl
making kwargs for TEAK_large_243
133 kwargs written to kwargs/TEAK_large_243.pkl
making kwargs for TEAK_large_244


In [42]:
kwargs_files = [name for name in os.listdir('kwargs/') if name.startswith('TEAK')]
kwargs_tilenames = [name.split('.')[0] for name in kwargs_files]
len(kwargs_tilenames)

239

In [43]:
output_dir = old_output_dir

job_ids_dir = output_dir + 'job_ids/'
submitted_jobs = [name for name in os.listdir(job_ids_dir)  if os.path.isfile(os.path.join(job_ids_dir, name))]
submitted_jobs_tilenames = [name.split('_job')[0] for name in submitted_jobs]
submitted_jobs_tilenames

['TEAK_large_001',
 'TEAK_large_002',
 'TEAK_large_003',
 'TEAK_large_004',
 'TEAK_large_005',
 'TEAK_large_006',
 'TEAK_large_007',
 'TEAK_large_015',
 'TEAK_large_016',
 'TEAK_large_017',
 'TEAK_large_018',
 'TEAK_large_019',
 'TEAK_large_020',
 'TEAK_large_021',
 'TEAK_large_022',
 'TEAK_large_023',
 'TEAK_large_024',
 'TEAK_large_025',
 'TEAK_large_026',
 'TEAK_large_027',
 'TEAK_large_028',
 'TEAK_large_029',
 'TEAK_large_030',
 'TEAK_large_031',
 'TEAK_large_032',
 'TEAK_large_033',
 'TEAK_large_034',
 'TEAK_large_035',
 'TEAK_large_036',
 'TEAK_large_037',
 'TEAK_large_038',
 'TEAK_large_039',
 'TEAK_large_040',
 'TEAK_large_041',
 'TEAK_large_042',
 'TEAK_large_043',
 'TEAK_large_044',
 'TEAK_large_045',
 'TEAK_large_046',
 'TEAK_large_047',
 'TEAK_large_048',
 'TEAK_large_049',
 'TEAK_large_050',
 'TEAK_large_051',
 'TEAK_large_052',
 'TEAK_large_053',
 'TEAK_large_054',
 'TEAK_large_056',
 'TEAK_large_057',
 'TEAK_large_058',
 'TEAK_large_059',
 'TEAK_large_060',
 'TEAK_large

In [44]:
len(submitted_jobs)

221

In [45]:
import re
# sort the files by their number

# d matches int, + matches consecutive ints
# this only works because we have only one int in the file
def extract_integer(filename):
    match = re.search(r'\d+', filename)
    return int(match.group()) if match else 0  # Return 0 if no match found

# Sort the list using the extracted integers as the key
sorted_filenames = sorted(kwargs_files, key=extract_integer)

sorted_filenames[221:]

['TEAK_large_231.pkl',
 'TEAK_large_232.pkl',
 'TEAK_large_233.pkl',
 'TEAK_large_234.pkl',
 'TEAK_large_235.pkl',
 'TEAK_large_237.pkl',
 'TEAK_large_238.pkl',
 'TEAK_large_239.pkl',
 'TEAK_large_240.pkl',
 'TEAK_large_241.pkl',
 'TEAK_large_242.pkl',
 'TEAK_large_243.pkl',
 'TEAK_large_244.pkl',
 'TEAK_large_245.pkl',
 'TEAK_large_246.pkl',
 'TEAK_large_247.pkl',
 'TEAK_large_248.pkl',
 'TEAK_large_249.pkl']

In [46]:
fname = 'step2_submit_from_' + str(len(submitted_jobs)) + '.log'


In [47]:
print(fname)

step2_submit_from_221.log


In [48]:
# Submit jobs for each tile
username = "jclayton0"
# job_limit = 1000
check_interval = 90 #seconds between updates

##### REMOVE this block if doing new run
start_time = datetime.datetime.now()
logging.basicConfig(filename=fname,
                        level=logging.INFO,
                        format='%(asctime)s - %(message)s',
                        datefmt='%Y-%m-%d %H:%M:%S')
log_and_print(f"Starting new model run at MAAP at {start_time}.")
output_dir = old_output_dir 
# files = failed_inputs
######
## pick up where we left off: don't forget to make sure the newest file in that dir successfully completed.
tile_indexer = len(submitted_jobs)

interval_time = datetime.datetime.now()
kwargs_files = [name for name in os.listdir('kwargs/') if name.startswith('TEAK')]
#kwargs_files = kwargs_files[:2]
kwargs_files = sorted_filenames = sorted(kwargs_files, key=extract_integer)
kwargs_files = kwargs_files[tile_indexer:]

njobs_total = 0
for file in kwargs_files:
    filepath = 'kwargs/' + file
    with open(filepath, 'rb') as f:
        job_kwargs_list = pickle.load(f)

    job_ids_file = file.split('.')[0] + '_job_ids.pkl'
    job_ids_file = output_dir + "job_ids/" + job_ids_file
    jobs = []
    for job_kwargs in job_kwargs_list:
        job = maap.submitJob(**job_kwargs)
        jobs.append(job)

    njobs_batch = len(jobs)
    print(f"Submitted {njobs_batch} jobs for {file}")

    job_ids = [job.id for job in jobs]

    # Write job IDs to a file in case processing is interrupted
    with open(job_ids_file, 'wb') as f:
        pickle.dump(job_ids, f)
        
    log_and_print(f"Job IDs written to {job_ids_file}")
    njobs_total += njobs_batch
    curr_time = datetime.datetime.now()

    batch_interval = curr_time - interval_time
    total_interval = curr_time - start_time

    batch_interval = batch_interval.total_seconds()
    total_interval = total_interval.total_seconds()

    batch_avg = njobs_batch / batch_interval
    total_avg = njobs_total / total_interval

    log_and_print(f"submitted {njobs_batch} in {batch_interval} seconds, {batch_avg} s/job")
    log_and_print(f"submitted {njobs_total} in total in {total_interval} seconds, {total_avg} s/job")
    interval_time = curr_time



Starting new model run at MAAP at 2024-11-22 17:35:49.869633.
Submitted 1600 jobs for TEAK_large_231.pkl
Job IDs written to ../my-private-bucket/run_output_20241115_131005/job_ids/TEAK_large_231_job_ids.pkl
submitted 1600 in 175.156437 seconds, 9.134691407316078 s/job
submitted 1600 in total in 175.157359 seconds, 9.134643323778363 s/job
Submitted 1197 jobs for TEAK_large_232.pkl
Job IDs written to ../my-private-bucket/run_output_20241115_131005/job_ids/TEAK_large_232_job_ids.pkl
submitted 1197 in 130.613697 seconds, 9.164429363024615 s/job
submitted 2797 in total in 305.771056 seconds, 9.147366780196489 s/job
Submitted 201 jobs for TEAK_large_233.pkl
Job IDs written to ../my-private-bucket/run_output_20241115_131005/job_ids/TEAK_large_233_job_ids.pkl
submitted 201 in 21.861098 seconds, 9.194414662978046 s/job
submitted 2998 in total in 327.632154 seconds, 9.15050602756163 s/job
Submitted 19 jobs for TEAK_large_234.pkl
Job IDs written to ../my-private-bucket/run_output_20241115_131005/

In [None]:
## Need to get the failed jobs, remove them from job IDs, 
## save successful job paths to a file
## 
output_dir = old_output_dir

job_ids_dir = output_dir + 'job_ids/'
submitted_jobs = [name for name in os.listdir(job_ids_dir)  if os.path.isfile(os.path.join(job_ids_dir, name))]
# submitted_jobs_tilenames = [name.split('_job')[0] for name in submitted_jobs]
# submitted_jobs_tilenames
gathered_paths = [name for name in os.listdir('seg_out_paths/')]

resubmitted_jobs = []

for file in submitted_jobs[len(gathered_paths):]:
    filepath = job_ids_dir + file
    with open(filepath, 'rb') as f:
        job_ids = pickle.load(f)

    total_jobs = len(job_ids)
    #succeeded = get_succeeded_jobs_in_list(job_ids)
    unsuccessful = get_unsuccessful_jobs_in_list(job_ids)
   # other_status = total_jobs - len(succeeded) - len(failed) #get_other_status_jobs_in_list(job_ids)
    for job in unsuccessful:
        job_ids.remove(job)

    rds_paths = []
    for job_id in tqdm(job_ids):
        job_result_url = job_result_for(job_id)
       # time.sleep(1) # to avoid overwhelming the API
        job_output_dir = to_job_output_dir(job_result_url, username)
        # Find .tar.gz file in the output dir
        rds_file = [f for f in os.listdir(job_output_dir)
                     if f.endswith('.rds')]
        if rds_file:
            rds_paths.append(os.path.join(job_output_dir, rds_file[0]))
        
    tilename = file.split('_job')[0]

    output_paths_file = 'seg_out_paths/' + tilename + 'output_paths.pkl'

    with open(output_paths_file, 'wb') as f:
        pickle.dump(rds_paths, f)

    failed_job_urls = [job_result_for(job) for job in unsuccessful]
    failed_params = [to_failed_job_params(url) for url in failed_job_urls]
    failed_inputs = [get_failed_job_input_LAS(json) for json in failed_params]

    failed_inputs_file = 'failed_inputs/' + tilename + '_failed.pkl'

    with open(failed_inputs_file, 'wb') as f:
        pickle.dump(failed_inputs, f)

    
    print(f"saved {total_jobs} output paths for {tilename} to {output_paths_file} and {len(failed_inputs)} failed filepaths to {failed_inputs_file}")

    
    # if(total_jobs == 0):
    #     print(f"No jobs for {file.split('_job')[0]}")
    #     continue;
    
    # print(f"Of {total_jobs} jobs in {file.split('_job')[0]}, {len(succeeded)/total_jobs}% succeeded, {len(failed)/total_jobs}% failed, and {other_status} jobs had other statuses")   


In [75]:
# gathered_paths = [name for name in os.listdir('seg_out_paths/')]
# submitted_jobs[len(gathered_paths):]

['TEAK_large_020_job_ids.pkl',
 'TEAK_large_021_job_ids.pkl',
 'TEAK_large_022_job_ids.pkl',
 'TEAK_large_023_job_ids.pkl',
 'TEAK_large_024_job_ids.pkl',
 'TEAK_large_025_job_ids.pkl',
 'TEAK_large_026_job_ids.pkl',
 'TEAK_large_027_job_ids.pkl',
 'TEAK_large_028_job_ids.pkl',
 'TEAK_large_029_job_ids.pkl',
 'TEAK_large_030_job_ids.pkl',
 'TEAK_large_031_job_ids.pkl',
 'TEAK_large_032_job_ids.pkl',
 'TEAK_large_033_job_ids.pkl',
 'TEAK_large_034_job_ids.pkl',
 'TEAK_large_035_job_ids.pkl',
 'TEAK_large_036_job_ids.pkl',
 'TEAK_large_037_job_ids.pkl',
 'TEAK_large_038_job_ids.pkl',
 'TEAK_large_039_job_ids.pkl',
 'TEAK_large_040_job_ids.pkl',
 'TEAK_large_041_job_ids.pkl',
 'TEAK_large_042_job_ids.pkl',
 'TEAK_large_043_job_ids.pkl',
 'TEAK_large_044_job_ids.pkl',
 'TEAK_large_045_job_ids.pkl',
 'TEAK_large_046_job_ids.pkl',
 'TEAK_large_047_job_ids.pkl',
 'TEAK_large_048_job_ids.pkl',
 'TEAK_large_049_job_ids.pkl',
 'TEAK_large_050_job_ids.pkl',
 'TEAK_large_051_job_ids.pkl',
 'TEAK_l

In [64]:
filepath = job_ids_dir + submitted_jobs[20],
with open(filepath, 'rb') as f:
    job_ids = pickle.load(f)

unsuc = get_unsuccessful_jobs_in_list(job_ids)
print(len(unsuc))




failed_job_urls = [job_result_for(job) for job in unsuc]
failed_params = [to_failed_job_params(url) for url in failed_job_urls]
failed_inputs = [get_failed_job_input_LAS(json) for json in failed_params]

failed_inputs

## only want to check all the jobs once. so we will make a list of every single failed job
## and store them as a dict, with the key being the s3 url and the value being the job ID
## then after we make and save the failed jobs dict
## iterate over the dict , make kwargs from the key, 

5


['s3://maap-ops-workspace/jclayton0/run_output_20241115_131005/TEAK_large_028/TEAK_large_028_pc_1400.rds',
 's3://maap-ops-workspace/jclayton0/run_output_20241115_131005/TEAK_large_028/TEAK_large_028_pc_1494.rds',
 's3://maap-ops-workspace/jclayton0/run_output_20241115_131005/TEAK_large_028/TEAK_large_028_pc_716.rds',
 's3://maap-ops-workspace/jclayton0/run_output_20241115_131005/TEAK_large_028/TEAK_large_028_pc_97.rds',
 's3://maap-ops-workspace/jclayton0/run_output_20241115_131005/TEAK_large_028/TEAK_large_028_pc_974.rds']

In [None]:
get_f

In [None]:

    # Initialize job states
    final_states = ["Succeeded", "Failed", "Deleted"]

    job_states = {job_id: "" for job_id in job_ids}
    update_job_states(job_states, final_states, batch_size=50, delay=10)

    known_completed = len([state for state in job_states.values()
                           if state in final_states])

    while True:
        try:
            with tqdm(total=len(job_ids), desc="Jobs Completed", unit="job") as pbar:
                while any(state not in final_states for state in job_states.values()):

                    # Update the job states
                    n_new_completed: int = update_job_states(job_states,
                                                             final_states,
                                                             batch_size = 50,
                                                             delay = 10)

                    # Update the progress bar
                    pbar.update(n_new_completed)
                    last_updated = datetime.datetime.now()
                    known_completed += n_new_completed
                    
                    status_counts = {status: list(job_states.values()).count(status)
                                     for status in final_states + ["Accepted", "Running"]}
                    status_counts["Other"] = len(job_states) - sum(status_counts.values())
                    status_counts["Last updated"] = last_updated.strftime("%H:%M:%S")

                    pbar.set_postfix(status_counts, refresh=True)

                    if known_completed == len(job_ids):
                        break

                    time.sleep(check_interval)

        except KeyboardInterrupt:
            print("Are you sure you want to cancel the process?")
            print("Press Ctrl+C again to confirm, or wait to continue.")
            try:
                time.sleep(3)
                print("Continuing...")
            except KeyboardInterrupt:
                print("Model run aborted.")
                pending_jobs = [job_id for job_id, state in job_states.items()
                                if state not in final_states]
                click.echo(f"Cancelling {len(pending_jobs)} pending jobs.")
                for job_id in pending_jobs:
                    maap.cancelJob(job_id)
                break
        else:
            break

    # Process the results once all jobs are completed
    succeeded_job_ids = get_succeeded_jobs_in_list(job_ids)
    
    failed_job_ids = get_failed_jobs_in_list(job_ids)

    other_job_ids = get_other_status_jobs_in_list(job_ids) 

    click.echo(f"Processing results for {len(succeeded_job_ids)} "
               f"succeeded jobs.")

    click.echo(f"Gathering tarball paths from succeeded jobs.")

    tar_paths = []
    for job_id in tqdm(succeeded_job_ids):
        job_result_url = job_result_for(job_id)
        time.sleep(1) # to avoid overwhelming the API
        job_output_dir = to_job_output_dir(job_result_url, username)
        # Find .tar.gz file in the output dir
        tar_file = [f for f in os.listdir(job_output_dir)
                     if f.endswith('.tar.gz')]
        if len(tar_file) > 1:
            warnings.warn(f"Multiple .tar.gz files found in "
                          f"{job_output_dir}.")
        if len(tar_file) == 0:
            warnings.warn(f"No .tar.gz files found in "
                          f"{job_output_dir}.")
        if tar_file:
            tar_paths.append(os.path.join(job_output_dir, tar_file[0]))

    # Log the succeeded and failed job IDs
    logging.info(f"{len(succeeded_job_ids)} jobs succeeded.")
    logging.info(f"Succeeded job IDs: {succeeded_job_ids}\n")
    logging.info(f"{len(failed_job_ids)} jobs failed.")
    logging.info(f"Failed job IDs: {failed_job_ids}\n")
    logging.info(f"{len(other_job_ids)} jobs in other states.")
    logging.info(f"Other job IDs: {other_job_ids}\n")

    # Copy all tarballs to the output directory
    click.echo(f"Copying {len(tar_paths)} Tarballs to {output_dir}.")
    copy_batch_count = 0
    for tar_path in tqdm(tar_paths):
        try:
            shutil.copy(tar_path, output_dir)
            copy_batch_count += 1
            if copy_batch_count == 50:
                time.sleep(60)
                copy_batch_count = 0
            else:
                time.sleep(2)
            
        except Exception as e:
            warnings.warn(f"Error copying {tar_path} to {output_dir}: {str(e)}")
            click.echo("Retrying in 10 seconds.")
            time.sleep(10)
            try:
                shutil.copy(tar_path, output_dir)
            except Exception as e:
                click.echo(f"Retry failed: {str(e)}")
                click.echo(f"Skipping {tar_path}.")
                continue

    # Compress the output directory
    # click.echo(f"Compressing output directory.")
    # shutil.make_archive(output_dir, 'zip', output_dir)
    # click.echo(f"Output directory compressed to {output_dir}.zip.")

    end_time = datetime.datetime.now()

    log_and_print(f"Model run completed at {end_time}.")
    
# if __name__ == "__main__":
#     main()

In [21]:
with open(kwargs_file, 'rb') as f:
    kwargs = pickle.load(f)


[{'identifier': 'Mean Shift Segment LAS',
  'algo_id': 'MS-Step-2-Segment-v2',
  'version': 'main',
  'username': 'jclayton0',
  'queue': 'maap-dps-cuny-worker-64gb',
  'Point Cloud RDS': 's3://maap-ops-workspace/jclayton0/run_output_20241115_131005/TEAK_large_027/TEAK_large_027_pc_1.rds',
  'Subplot_widthFrac_cores': 0.9},
 {'identifier': 'Mean Shift Segment LAS',
  'algo_id': 'MS-Step-2-Segment-v2',
  'version': 'main',
  'username': 'jclayton0',
  'queue': 'maap-dps-cuny-worker-64gb',
  'Point Cloud RDS': 's3://maap-ops-workspace/jclayton0/run_output_20241115_131005/TEAK_large_027/TEAK_large_027_pc_10.rds',
  'Subplot_widthFrac_cores': 0.9},
 {'identifier': 'Mean Shift Segment LAS',
  'algo_id': 'MS-Step-2-Segment-v2',
  'version': 'main',
  'username': 'jclayton0',
  'queue': 'maap-dps-cuny-worker-64gb',
  'Point Cloud RDS': 's3://maap-ops-workspace/jclayton0/run_output_20241115_131005/TEAK_large_027/TEAK_large_027_pc_100.rds',
  'Subplot_widthFrac_cores': 0.9},
 {'identifier': 'Me

## What is gone:

* Code to make the kwargs from the directory
* Code to save the kwargs as pickles
* Code to read the kwargs from pickles and submit the jobs
* Any other code and notes I had there.

## What to do: 

1. Restart the shell script
2. Make internal hyperlinks
3. Plan every part of the rest of this
4. rewrite missing code
5. Push to github regularly
6. Change step 1 so that it does not tarball things.

# What does the script still need to do?

1. Missing old parts up to submitting the segmentation jobs
2. Check every job ID for missing, resubmit the failed, and then overwrite the failed IDs
   * probably this looks like copying the successful IDs over and adding the newly submitted IDs to that list, then saving that list in the same file
3. Move all the segmented tiles into one directory and tarball them
   * this looks like what I did before: gather the paths for each and save them to a pkl, then move them en masse to new dir.
   * Then this looks like: tarballing them all
4. Then write a script to gather all the tarball paths and make kwargs from them
5. Then submit these all as jobs to step 3
6. Then gather outputs and move them, again.
7. Finally submit everything that's left to step 4.
8. Gather Step 4 outputs and then make the CSV. Probably on our server.