# Use Data from the OSDF to Run Jobs

In this section of the tutorial, we will expand our horizons from analyzing one file at a time (accessing the data using the Pelican CLI or PelicanFS) to analyzing many files, running this analysis as a workload on the OSPool, integrated with the OSDF/Pelican. 

## Starting Data File

We will use the station list for this example: 

```
osdf:///aws-opendata/us-east-1/noaa-ghcn-pds/ghcnd-stations.txt
```

Download it as shown in the [command line client notebook](01-get-and-share-objects.ipynb). 

## Scenario: a List of Jobs

Suppose we wanted to run our analysis on each station. How many stations are there, again? 

In [None]:
wc -l ghcnd-stations.txt

That's a long list of tasks to run!

Luckily, this workload profile - a list of jobs - is a perfect fit for execution via an HTCondor Access Point, on a system 
like the Open Science Pool. All we have to do to define this workload is to make a list and a job template. 

We could use the whole ghcnd-stations.txt file as our list, but for simplicity, we'll cut the full list down to about 10 stations. 

In [None]:
head -n 126040 ghcnd-stations.txt | tail -n 10 | cut -d " " -f 1 > station_list.txt 

## Job Template

The following information needs to be communicated in the HTCondor job file: 

- **Software environment** 
    - The job needs to bring along a software environment with needed dependencies (Python, pandas, matplotlib)
    - in our example, we will use an existing container with these tools installed (that also happens to be available via the OSDF)
- **What the job should run**
    - The command to be executed is listed in the `executable` and `arguments` lines of the submit file. 
    - For our example, the executable is the `example.py` script and the argument is the station ID. 
- **Inputs (both scripts and data)**
    - All the inputs needed by the executable must also be transferred with the jobs. 
    - We need to include both the helper script for the code and the Pelican URL to the data file. **HTCondor has its own code that is able to leverage the Pelican client when an input file takes the form of a Pelican URL.**
- **Recording information about the job**
    - As with many other schedulers, HTCondor provides options for recording the standard output and error 
    of a running job. Note below that these files are organized into their own directory. 
- **Resource needs**
    - Default resources that should be set for every HTCondor job list include cores, memory (RAM) and local disk on the execution point. 
    - For this example, we will request 1 core, 4GB of RAM and 4GB of disk. 

Each of these items is reflected in the example submit file. Every line of the submit file (except the last one) 
should be thought of as the template for one job. At any point 
in this template where there is data that will be different for each job, we've placed a variable as a placeholder -- 
the variable format is `$(variable_name)`. 

In [None]:
cat example.sub

The last line (`queue station_id from station_list.txt`) is what transforms this example into a job list -- HTCondor 
will iterate through the items in our list and create a job for each one. 

## Submitting Jobs

We can now submit our list of jobs: 

In [None]:
condor_submit example.sub

Jobs can be monitored using `condor_q`: 

In [None]:
condor_q

Once completed, our images will appear in the `results` folder. 

In [None]:
ls -lh results