# Get and Share Objects with the Pelican Client

TODO: Summary

### Objectives

- define each component of a pelican / OSDF URL (markdown text)
- list the key verbs for interacting with objects (markdown text)
- apply knowledge to get an object, create an output, and put it somewhere.
    - randomly assign a URL to each person via * magic *
    - have them get a file
    - run the script that visualizes the data
    - put the resulting file back into a common origin
    - use pelican ls
    - pull someone else's visualization + look at it. 

## Our Data

The data we'll be working with today is the [NOAA Global Historical Climatology Network](https://www.ncei.noaa.gov/metadata/geoportal/rest/metadata/item/gov.noaa.ncdc:C00861/html) dataset. From the [README](https://docs.opendata.aws/noaa-ghcn-pds/readme.html): 

> GHCN-Daily is a dataset that contains daily observations over global land areas. It contains station-based measurements from land-based stations worldwide, about two thirds of which are for precipitation measurements only (Menne et al., 2012). GHCN-Daily is a composite of climate records from numerous sources that were merged together and subjected to a common suite of quality assurance reviews (Durre et al., 2010). 


The GHCN data set is available via Amazon AWS S3, at 

```
https://noaa-ghcn-pds.s3.amazonaws.com/
```

The OSDF is already connected to AWS under the `/aws-opendata` prefix, so we will be able to access this data via Pelican and the OSDF. 

## Get Data Objects

In order to access an object in the OSDF, we need to construct a URL. This URL has two pieces: the **namespace** and then the **object path**

TODO: add Andrew's image of URL construction. 

To access this data in this bucket via the OSDF, we need to know the "namespace prefix" of this dataset within the OSDF.

As mentioned, the OSDF includes AWS data under the `/aws-opendata` namespace prefix. The GHCN 
website shows the data is in the "US East 1" part of AWS, so we'll extend the OSDF namespace prefix to
`/aws-opendata/us-east-1`.
From the above link, we see that the GHCN dataset is linked in the AWS under `noaa-ghcn-pds`, so the full namespace prefix to the
dataset in the OSDF is `/aws-opendata/us-east-1/noaa-ghcn-pds/`.

We can't (currently) list the objects in this location, but you can browse the AWS index link 
([https://noaa-ghcn-pds.s3.amazonaws.com/](https://noaa-ghcn-pds.s3.amazonaws.com/)) to see the files available.

In the top "level" of the dataset are several readme files.
Let's get the list of stations that are contained in the dataset, so we can identify what files we want to download.

The file `ghcnd-stations.txt` contains the desired list. 
This is the "object name" that we want to fetch using the OSDF.
We combine the "namespace prefix" and the "object name" together to get the desired OSDF link:

```
osdf:///aws-opendata/us-east-1/noaa-ghcn-pds/ghcnd-stations.txt`.
```

To download the file, we use the Pelican client with the OSDF URL:


In [5]:
pelican object get osdf:///aws-opendata/us-east-1/noaa-ghcn-pds/ghcnd-stations.txt ./

ghcnd-stations.txt 0.00 b / 10.50 MiB [--------------------------] 0s ] 0.00 b/s
[1A[Jghcnd-stations.txt 0.00 b / 10.50 MiB [--------------------------] 0s ] 0.00 b/s
[1A[Jghcnd-stations.txt 1.75 MiB / 10.50 MiB [===>--------------------] 0s ] 0.00 b/s
[1A[Jghcnd-stations.txt 2.00 MiB / 10.50 MiB [====>-------------------] 0s ] 0.00 b/s
[1A[Jghcnd-stations.txt 2.00 MiB / 10.50 MiB [====>-------------------] 0s ] 0.00 b/s
[1A[Jghcnd-stations.txt 2.00 MiB / 10.50 MiB [====>-------------------] 0s ] 0.00 b/s
[1A[Jghcnd-stations.txt 2.00 MiB / 10.50 MiB [====>-------------------] 0s ] 0.00 b/s
[1A[Jghcnd-stations.txt 2.00 MiB / 10.50 MiB [====>-------------------] 0s ] 0.00 b/s
[1A[Jghcnd-stations.txt 2.00 MiB / 10.50 MiB [====>-------------------] 0s ] 0.00 b/s
[1A[Jghcnd-stations.txt 2.00 MiB / 10.50 MiB [====>-------------------] 0s ] 0.00 b/s
[1A[Jghcnd-stations.txt 2.00 MiB / 10.50 MiB [====>-------------------] 0s ] 0.00 b/s
[1A[Jghcnd-stations.txt 2.00 MiB / 10

Once downloaded, we can view the contents: 

In [None]:
head ghcnd-stations.txt

### Download specific station data

Next we will download all the data for a specific station. To do this, we'll need the station ID -- 
the first field in each record of the `ghcnd-stations.txt` file. 

There are a lot of stations listed (over 120,000!!). For this example, we'll use the airport in Madison, WI. The 
record for that station is: 

```
USW00014837  43.1406  -89.3453  261.8 WI MADISON DANE CO RGNL AP                72641
```

So we'll be using station ID `USW00014837`.

Once again, we will need to construct our URL. The namespace prefix hasn't changed, but the path to the 
station data will be under the path `csv/by_station` and the filename uses the syntax `<STATION ID>.csv`. 

Building the URL, this gives: 

```
osdf:///aws-opendata/us-east-1/noaa-ghcn-pds/csv/by_station/USW00014837.csv
```

In [None]:
pelican object get osdf:///aws-opendata/us-east-1/noaa-ghcn-pds/csv/by_station/USW00014837.csv ./

And we can again view the contents of the file:

In [2]:
head USW00014837.csv

ID,DATE,ELEMENT,DATA_VALUE,M_FLAG,Q_FLAG,S_FLAG,OBS_TIME
USW00014837,19391001,TMAX,194,,,X,
USW00014837,19391002,TMAX,211,,,X,
USW00014837,19391003,TMAX,233,,,X,
USW00014837,19391004,TMAX,272,,,X,
USW00014837,19391005,TMAX,211,,,X,
USW00014837,19391006,TMAX,250,,,X,
USW00014837,19391007,TMAX,294,,,X,
USW00014837,19391008,TMAX,261,,,X,
USW00014837,19391009,TMAX,239,,,X,


## Share Data Objects

Let's visualize the data we just downloaded and share our results via the OSDF. 

In [3]:
python3 scripts/example.py USW00014837

ELEMENT          TMIN          TMAX
count    21701.000000  21701.000000
mean        35.611256     56.267353
std         20.147128     22.528634
min        -36.940000    -14.080000
25%         21.920000     37.040000
50%         35.960000     57.920000
75%         51.980000     75.920000
max         82.940000    104.000000


Plotting histograms of observations for 21,701 days, spanning 59.4 years 
from 1939-10-01 to 1999-02-28, to 'USW00014837.png' .



This should produce a plot: 

![](USW00014837.png)

These results can be shared using our own local origin. 

As before, the first step will be constructing the URL where we want to place the data. The namespace prefix is `TBD`. In order to avoid collisions, please add your as part of the destination URL so that 

In [4]:
pelican object put USW00014837.png osdf:///tutorial-origin.osdf-dev.chtc.io/test/ckoch-USW00014837.png

[31mERROR[0m[2025-01-28T14:09:42-06:00] error while querying the director at https://osdf-director.osg-htc.org: 404: No namespace found for path. Either it doesn't exist, or the Director is experiencing problems 
[31mERROR[0m[2025-01-28T14:09:42-06:00] Failure putting USW00014837.png: failed to get namespace information for remote URL osdf:///tutorial-origin.osdf-dev.chtc.io/test/ckoch-USW00014837.png: error while querying the director at https://osdf-director.osg-htc.org: 404: No namespace found for path. Either it doesn't exist, or the Director is experiencing problems 


: 1

## List Data Objects

TODO?